SlideShare ist ein Scribd-Unternehmen logo
1 von 29
Downloaden Sie, um offline zu lesen
Universitat Politècnica de Catalunya
BARCELONA TECH
Madrid. March 30, 2017
Eduard Ayguadé
Dep. Arquitectura de Computadors, U. Politècnica de Catalunya
Computer Sciences Dep., Barcelona Supercomputing Center
OpenMP tasking model: from the
standard to the classroom
ICT386 multiprocessor architecture, late 80s
1
4 x CPU module
i386/i387
no private cache memory
4 x Memory module
4 MB DRAM (100 ns)
64 KB shared cache
25 ns, direct-mapped
ICT386 multiprocessor architecture, late 80s
2
Crossbar 5x5
Gate array design
10 ns transfer
Intellligent memory
Memory mapped
1 MB DRAM
Functions: management,
synchronization and
scheduling
ProgrammableNIC
And the wall has become higher and higher …
3
big.LITTLE Integrated GPU
Socket
S1 S2
GPU
G
D
D
R
GPU
G
D
D
R
D
D
R
D
D
R
FPGA
N
V
M
HCA
CAPIPCIe
PCIePCIe
QPI
Node
Rank Name Site Computer
Total
Cores
Accelerator/
Co-Proc.
Cores
Rmax (Gflops)
Rpeak
(Gflops)
Power (KW) Mflops/Wa
1
Sunway
TaihuLight
National
SupercomputingCenter
in Wuxi
Sunway MPP, Sunway
SW26010 260C 1.45GHz,
Sunway 10649600 0 93014593,88 125435904 15371 6051,
2
Tianhe-2
(MilkyWay-2)
National Super
Computer Center in
Guangzhou
TH-IVB-FEP Cluster, Intel
Xeon E5-269212C 2.200GHz,
TH Express-2, Intel Xeon Phi
31S1P 3120000 2736000 33862700 54902400 17808 1901,
3 Titan
DOE/SC/Oak Ridge
National Laboratory
Cray XK7, Opteron 6274 16C
2.200GHz, Cray Gemini
interconnect, NVIDIA K20x 560640 261632 17590000 27112550 8209 2142,
4 Sequoia DOE/NNSA/LLNL
BlueGene/Q,Power BQC 16C
1.60 GHz, Custom 1572864 0 17173224 20132659,2 7890 2176,
5 Cori DOE/SC/LBNL/NERSC
Cray XC40, Intel Xeon Phi
7250 68C 1.4GHz, Aries
interconnect 622336 0 14014700 27880653 3939 3557,
6
Oakforest-
PACS
Joint Center for
Advanced High
Performance Computing
PRIMERGY CX1640 M1, Intel
Xeon Phi 7250 68C 1.4GHz,
Intel Omni-Path 556104 0 13554600 24913459 2718 4985,
7 K computer
RIKEN Advanced
Institute for
Computational Science
(AICS)
K computer, SPARC64 VIIIfx
2.0GHz, Tofu interconnect 705024 0 10510000 11280384 12660 830,
8 Piz Daint
Swiss National
SupercomputingCentre
(CSCS)
Cray XC50, Xeon E5-2690v3
12C 2.6GHz, Aries
interconnect , NVIDIA Tesla
P100 206720 170240 9779000 15987968 1312 7453,
9 Mira
DOE/SC/Argonne
National Laboratory
BlueGene/Q,Power BQC 16C
1.60GHz, Custom 786432 0 8586612 10066330 3945 2176,
10 Trinity DOE/NNSA/LANL/SNL
Cray XC40, Xeon E5-2698v3
16C 2.3GHz, Aries
interconnect 301056 0 8100900 11078861 4232 1913,
Top5 November 2016 list
1979 Pink Floyd’s The Wall album,
Cray-1 delivering 80 Mflops, single processor
OpenMP evolution and
OmpSs influence
Runtime opportunities
Architecture opportunities
Teaching opportunities
Programmability
The vision
Plethora of architectures
• Heterogeneity
• Hierarchies
ISA Low-level interfaces
ISA/PM leaks and
variability/faults/errors
everywhere
• Programmer exposed
• Divergence between
mental models and
actual behavior
New application domains
Programming models
Applications
Programmability and
porting/inter-operability
• Legacy codes
• ”Legacy” mental models
and habits
APIs
New usage practices
• Analytics and visualization
• New data models
• Interactive supercomputing, response
time
• Virtualized data centers
• Cloud
Programming interfaces
Ana…
…lytics
9
The vision
10
Intelligent runtime,
parallelization,
distribution,
interoperability
General purpose
Task based
Single address
space
Program logic
independent of
computing platform
Performance-
and power-driven
optimization/tuning
Resource
management
ISA Low-level interfaces APIs
Runtime system Resource management
Performance, power, energy… monitoring/insight
Applications
Programming model: high-level, clean, abstract interface
7
The OmpSs forerunner for OpenMP
Proposing ideas …
– IWOMP workshop is the ideal scenario
– Participation in the OpenMP language committee and sub-committees
Demonstrated with …
– Prototype implementation: Mercurium compiler and Nanos runtime
– Benchmark/application suite
Lots of discussions and final polishing
3.02008
+ Tasks + Task
dependences
+ Task
priorities
+ Taskloop
prototyping
+ Task reductions
+ Dependences
on taskwait
+ Multidependences
+ Commutative
+ Dependences
on taskloop
Today
3.12011
4.02013
4.52015
5.0?
OmpSs
8
OmpSs influencing tasking model evolution (1)
The big change …
From loop-based model to task-based model
– Dynamically allocated data structures
(lists, trees, …)
– Recursion
task generation waits for child tasks to finish (3.0)
task-based model, nesting of tasks (3.0)
#pragma omp task firstprivate(list)
{ code block }
#pragma omp taskwait
task priorities (4.5)
priority(value)
task generation cut-off (3.1)
final(expr)
root
9
OmpSs influencing tasking model evolution (2)
Directionality of task arguments
Dependences between tasks computed
at runtime
Tasks executed as soon as all dependences
are satisfied
directionality for task arguments (4.0)
#pragma omp task depend(in: list, out: list, inout: list)
{ code block }
10
OmpSs influencing tasking model evolution (3)
Tasks become chunks of iterations of one or more associated
loops
Dependences between tasks generated in the same or
different taskloop construct under discussion
tasks out of loops and granularity control (4.5)
#pragma omp taskloop [num_tasks(value) | grainsize(value)]
for-loops
#pragma omp taskloop block(2) grainsize(B, B)
depend(in : block[i-1][j], block[i][j-1])
depend(in : block[i+1][j], block[i][j+1])
depend(out: block[i][j])
for (i=1; i<n i++)
for (j=1; j<n;j++)
foo(i, j);
T T T T
T T T T
T T T T
T T T T
11
OmpSs influencing tasking model evolution (4)
Reduction operations on task and taskloop not yet supported,
currently in discussion
#pragma omp taskgroup reduction(+: res)
{
while (node) {
#pragma omp task in_reduction(+: res)
res += node->value;
node = node->next;
}
}
#pragma omp taskloop grainsize(BS) reduction(+: res)
for (i = 0; i < N; ++i) {
res += foo(v[i]);
}
12
OmpSs influencing tasking model evolution (5)
The number of task dependences have to be constant at
compile time à multidependences proposal under discussion
for (i = 0; i < l.size(); ++i) {
#pragma omp task depend(inout: l[i]) // T1
{ ... }
}
#pragma omp task depend(in: {l[j], j=0:l.size()}) // T2
for (i = 0; i < l.size(); ++i) {
...
}
Defines an iterator with a range of values that only exists
in the context of the directive. Equivalent to:
depend(in: l[0], l[1], ..., l[l.size()-1])
T1 T1 T1 T1
T2
…
13
OmpSs influencing tasking model evolution (6)
New dependence type that relaxes the execution order of
commutative tasks but keeping mutual exclusion between
them
T1
T3
T2
T2
for (i = 0; i < l.size(); ++i) {
#pragma omp task depend(out: l[i])
// T1
}
for (i = 0; i < l.size(); ++i) {
#pragma omp task shared(x) depend(in: l[i])
depend(inout: x)
// T2
}
#pragma omp task shared(x) depend(in: x) // T3
for (i = 0; i < l.size(); ++i) {
...
}
T1
T1
T1
T2
T2
14
OmpSs influencing tasking model evolution (6)
New dependence type that relaxes the execution order of
commutative tasks but keeping mutual exclusion between
them
T1
T3
T2
T2
for (i = 0; i < l.size(); ++i) {
#pragma omp task depend(out: l[i])
// T1
}
for (i = 0; i < l.size(); ++i) {
#pragma omp task shared(x) depend(in: l[i])
depend(commutative: x)
// T2
}
#pragma omp task shared(x) depend(in: x) // T3
for (i = 0; i < l.size(); ++i) {
...
}
T1
T1
T1
T2
T2
Scheduling
New tasks
Dynamic task graph
Worker
threads
Execute
task
Nanos++runtime
…
newTask(call0, data0);
newTask(call1, data1);
…
waitTasks()
…
Application binary
Ready
task queue
End task: dependence update
15
OmpSs: runtime opportunities (1)
Runtime parallelization
– Compute data dependences among tasks à dynamic task graph
– Dataflow execution of tasks à ready task queue
– Look-ahead: task instantiation can proceed past non-ready tasks (with
opportunities to find distant parallelism)
16
OmpSs: runtime opportunities (2)
Criticallity-aware scheduler: managing heterogeneous SoC
– Tasks in the critical path detected at runtime à execute in faster cores
– Non-critical tasks executed in slower, more power-efficient cores
– Work-stealing for load balancing
“Criticality-Aware Dynamic Task Scheduling on Heterogeneous Architectures”.
K. Chronaki et al. ICS-2015
0,7
0,8
0,9
1
1,1
1,2
1,3
Cholesky Int. Hist QR Heat AVG
Speedup
Original CATS
Scheduling
New tasks
Dynamic task graph
Worker
threads
Execute
task
Nanos++runtime
…
newTask(call0, data0);
newTask(call1, data1);
…
waitTasks()
…
Application binary
Ready
task queue
End task: dependence update
17
OmpSs: runtime opportunities (3)
Runtime parallelization
Directory with per-core arguments utilization
– Locality-aware task scheduling strategies
– Driving argument prefetching
Directory
18
OmpSs: runtime opportunities (4)
Management of address spaces and coherence
– Directory@master: identifies for every region reference where data is
– Per-device software cache: translate host to device address spaces
– Use of device specific mechanisms to move data, if needed
– Locality-aware scheduling when multiple devices exist
Scheduling
Dynamic task graph
Nanos++runtime
Ready
task queue
End task: dependence update
Worker
threads
Helper
threads
Execute
task
MIC
threads
CUDA
threads
Data
requests
Directory / Caches
“Self-Adaptive OmpSs Tasks in Heterogeneous Environments”. J. Planas et al., IPDPS 2013.
Example: nbody
#pragma omp target device(smp) copy_deps
#pragma omp task depend(out: [size]out), in: [npart]part)
void comp_force (int size, float time, int npart, Part* part, Particle *out, int gid);
#pragma omp target device(fpga) ndrange(1,size,128) copy_deps implements (comp_force)
#pragma omp task depend(out: [size]out), in: [npart]part)
__kernel void comp_force_opencl (int size, float time, int npart, __global Part* part,
__global Part* out, int gid);
#pragma omp target device(cuda) ndrange(1,size,128) copy_deps implements (comp_force)
#pragma omp task depend(out: [size]out), in: [npart]part)
__global__ void comp_force_cuda (int size, float time, int npar, Part* part, Particle *out, int gid);
void Particle_array_calculate_forces(Particle* input, Particle *output, int npart, float time) {
for (int i = 0; i < npart; i += BS )
comp_force (BS, time, npart, input, &output[i], i);
}
20
OmpSs: runtime opportunities (5)
Support for multiple implementations for same task
– Versioning scheduler:
• Monitoring (learning) phase and scheduling phase
• Most suitable implementation is selected each time depending on
monitoring, simple model and OmpSs worker’s work load
SMP W1
SMP W2
GPU W3
Faster but
busy
Slower but
early executor
Run task!
“Self-Adaptive OmpSs Tasks in Heterogeneous Environments”. J. Planas et al., IPDPS 2013.
21
OmpSs: runtime opportunities (6)
At cluster …
– Task creation starts at master node, outermost tasks may be executed
on remote nodes and spawn inner tasks
– Locality-aware scheduler, overlap transfers/computation, work stealing
Scheduling
Directory/Cache
Nanos++runtime
masternode
Worker
threads
Helper
threads
Execute
task
Data
requests
Worker
threads
Helper
threads
Nanos++runtime
remotenode
data in host,
devices attached to node,
and remote nodes
data in host,
devices attached to node,
NIC NIC
“Productive Cluster Programming with OmpSs”. J. Bueno et al. EuroPar 2011.
“Productive Programming of GPU Clusters with OmpSs”. J. Bueno et al., IPDPS 2012.
22
OmpSs: architecture opportunities (1)
Novel hybrid cache/local memory
– Cache: irregular accesses
– Local memory (energy efficient, less
coherence traffic): strided accesses
Runtime-assisted prefetching
• Task inputs and outputs mapped to the LMs
• Overlap with runtime
• Double buffering between tasks
8.7% speedup, 14% reduction in power,
20% reduction in network-on-chip traffic
C C
L1
Cluster	Interconnect
LM
L1
LM
C C
L1
LM
L1
LM
C C
L1
LM
L1
LM
C C
L1
LM
L1
LM
DRAM DRAM
L2
L3
“Runtime-Guided Management of Scratchpad Memories in Multicores.” Ll. Alvarez et al. Ç
PACT-2015
OmpSs: architecture opportunities (3)
Task dependence manager to speedup task management and
dependence tracking
Hardware support for reductions
– Privatization of variables to speedup reduction à Significant overhead
when performing reductions on a vector or matrix block.
– HW support to perform simple reduction operations (+, -, *, max, min)
in the last-level cache or in memory à In-memory computation
X. Tan et al. Performance Analysis of a Hardware Accelerator of Dependency Management for Task-based Dataflow
Programming models. ISPASS 2016.
X. Tan et al. General Purpose Task-Dependence Management Hardware for Task-based Dataflow Programming
Models. IPDPS 2017.
J. Ciesko et al. Supporting Adaptive Privatization Techniques for Irregular Array Reductions in Task-parallel
Programming Models. IWOMP 2016.
24
OmpSs: architecture opportunities (2)
CATA: criticallity-aware task acceleration
– Runtime drives per-core DVFS reconfigurations meeting a global
power budget
– Hardware Runtime Support Unit (RSU) to minimize overheads that
grow with the number of cores
E. Castillo et al., CATA: Criticality Aware Task Acceleration for Multicore Processors. IPDPS 2016.
80%
90%
100%
110%
120%
130%
140%
150%
Performance imprv EDP imprv
Original CATS CATA CATA+RSU
32-core system
with 16 fast cores
25
Teaching opportunities (1)
Two courses in Computer Science BSc @ FIB (UPC)
– PAR – Parallelism: 5th semester, mandatory course
– PAP – Parallel Architectures and Programming: optional, computer
engineering specialization
OpenMP task-based programming @ node level
26
Teaching opportunities (2)
Syllabus
• Tasks, task graph, parallelism
• Overheads and model: task creation, synchronization, data sharing
• Task decomposition: linear, iterative, recursive
• Task ordering and data sharing constraints
• The magic of the architecture: cache coherency and synchronization
support
• Data decomposition and influence of memory
OpenMP tasks at its heart, going towards task-only parallel
programming
• Embarrassingly parallel, recursive
task decomposition, task/data
decomposition, and branch and bound
• Tareador tool for the exploration of
task decomposition strategies
P
PAR
Innovation:
(pseudo-) flipped-classroom model
27
Teaching opportunities (2)
PAP
Syllabus
• Compiler and runtime support to OpenMP
• Pthreads API and lib intrinsics
• Designing an HPC cluster: components
and interconnect, roofline model for flops/bw
• Message-passing interface MPI
OpenMP for node programming, with MPI towards
cluster programming
• OpenMP tasking review
• Implementing a runtime to support OpenMP
• Building a 4xOdroid cluster: hard and soft installation:
the effect of heterogeneity
• Programming the 4xOdroid cluster
with MPI: Stencil with Jacobi
Innovation: puzzle work in groups when designing cluster
28
Thanks!
https://www.bsc.es

Weitere ähnliche Inhalte

Was ist angesagt?

LEGaTO: Software Stack Runtimes
LEGaTO: Software Stack RuntimesLEGaTO: Software Stack Runtimes
LEGaTO: Software Stack RuntimesLEGATO project
 
Update on the Mont-Blanc Project for ARM-based HPC
Update on the Mont-Blanc Project for ARM-based HPCUpdate on the Mont-Blanc Project for ARM-based HPC
Update on the Mont-Blanc Project for ARM-based HPCinside-BigData.com
 
Use C++ and Intel® Threading Building Blocks (Intel® TBB) for Hardware Progra...
Use C++ and Intel® Threading Building Blocks (Intel® TBB) for Hardware Progra...Use C++ and Intel® Threading Building Blocks (Intel® TBB) for Hardware Progra...
Use C++ and Intel® Threading Building Blocks (Intel® TBB) for Hardware Progra...Intel® Software
 
Tokyo Webmining Talk1
Tokyo Webmining Talk1Tokyo Webmining Talk1
Tokyo Webmining Talk1Kenta Oono
 
EuroMPI 2016 Keynote: How Can MPI Fit Into Today's Big Computing
EuroMPI 2016 Keynote: How Can MPI Fit Into Today's Big ComputingEuroMPI 2016 Keynote: How Can MPI Fit Into Today's Big Computing
EuroMPI 2016 Keynote: How Can MPI Fit Into Today's Big ComputingJonathan Dursi
 
PAKDD2016 Tutorial DLIF: Introduction and Basics
PAKDD2016 Tutorial DLIF: Introduction and BasicsPAKDD2016 Tutorial DLIF: Introduction and Basics
PAKDD2016 Tutorial DLIF: Introduction and BasicsAtsunori Kanemura
 
Serving BERT Models in Production with TorchServe
Serving BERT Models in Production with TorchServeServing BERT Models in Production with TorchServe
Serving BERT Models in Production with TorchServeNidhin Pattaniyil
 
"Big Data" Bioinformatics
"Big Data" Bioinformatics"Big Data" Bioinformatics
"Big Data" BioinformaticsBrian Repko
 
Keynote (Mike Muller) - Is There Anything New in Heterogeneous Computing - by...
Keynote (Mike Muller) - Is There Anything New in Heterogeneous Computing - by...Keynote (Mike Muller) - Is There Anything New in Heterogeneous Computing - by...
Keynote (Mike Muller) - Is There Anything New in Heterogeneous Computing - by...AMD Developer Central
 
Data Science and Deep Learning on Spark with 1/10th of the Code with Roope As...
Data Science and Deep Learning on Spark with 1/10th of the Code with Roope As...Data Science and Deep Learning on Spark with 1/10th of the Code with Roope As...
Data Science and Deep Learning on Spark with 1/10th of the Code with Roope As...Databricks
 
“Introduction to the TVM Open Source Deep Learning Compiler Stack,” a Present...
“Introduction to the TVM Open Source Deep Learning Compiler Stack,” a Present...“Introduction to the TVM Open Source Deep Learning Compiler Stack,” a Present...
“Introduction to the TVM Open Source Deep Learning Compiler Stack,” a Present...Edge AI and Vision Alliance
 
Narayanan Sundaram, Research Scientist, Intel Labs at MLconf SF - 11/13/15
Narayanan Sundaram, Research Scientist, Intel Labs at MLconf SF - 11/13/15Narayanan Sundaram, Research Scientist, Intel Labs at MLconf SF - 11/13/15
Narayanan Sundaram, Research Scientist, Intel Labs at MLconf SF - 11/13/15MLconf
 
Accelerate Your Python* Code through Profiling, Tuning, and Compilation Part ...
Accelerate Your Python* Code through Profiling, Tuning, and Compilation Part ...Accelerate Your Python* Code through Profiling, Tuning, and Compilation Part ...
Accelerate Your Python* Code through Profiling, Tuning, and Compilation Part ...Intel® Software
 
Tech Days 2015: User Presentation Vermont Technical College
Tech Days 2015: User Presentation Vermont Technical CollegeTech Days 2015: User Presentation Vermont Technical College
Tech Days 2015: User Presentation Vermont Technical CollegeAdaCore
 
IS-ENES COMP Superscalar tutorial
IS-ENES COMP Superscalar tutorialIS-ENES COMP Superscalar tutorial
IS-ENES COMP Superscalar tutorialRoger Rafanell Mas
 
Biomedical Signal and Image Analytics using MATLAB
Biomedical Signal and Image Analytics using MATLABBiomedical Signal and Image Analytics using MATLAB
Biomedical Signal and Image Analytics using MATLABCodeOps Technologies LLP
 
C++ Data-flow Parallelism sounds great! But how practical is it? Let’s see ho...
C++ Data-flow Parallelism sounds great! But how practical is it? Let’s see ho...C++ Data-flow Parallelism sounds great! But how practical is it? Let’s see ho...
C++ Data-flow Parallelism sounds great! But how practical is it? Let’s see ho...Jason Hearne-McGuiness
 
Solve it Differently with Reactive Programming
Solve it Differently with Reactive ProgrammingSolve it Differently with Reactive Programming
Solve it Differently with Reactive ProgrammingSupun Dissanayake
 
Scalable and Distributed DNN Training on Modern HPC Systems
Scalable and Distributed DNN Training on Modern HPC SystemsScalable and Distributed DNN Training on Modern HPC Systems
Scalable and Distributed DNN Training on Modern HPC Systemsinside-BigData.com
 
Task Parallel Library Data Flows
Task Parallel Library Data FlowsTask Parallel Library Data Flows
Task Parallel Library Data FlowsSANKARSAN BOSE
 

Was ist angesagt? (20)

LEGaTO: Software Stack Runtimes
LEGaTO: Software Stack RuntimesLEGaTO: Software Stack Runtimes
LEGaTO: Software Stack Runtimes
 
Update on the Mont-Blanc Project for ARM-based HPC
Update on the Mont-Blanc Project for ARM-based HPCUpdate on the Mont-Blanc Project for ARM-based HPC
Update on the Mont-Blanc Project for ARM-based HPC
 
Use C++ and Intel® Threading Building Blocks (Intel® TBB) for Hardware Progra...
Use C++ and Intel® Threading Building Blocks (Intel® TBB) for Hardware Progra...Use C++ and Intel® Threading Building Blocks (Intel® TBB) for Hardware Progra...
Use C++ and Intel® Threading Building Blocks (Intel® TBB) for Hardware Progra...
 
Tokyo Webmining Talk1
Tokyo Webmining Talk1Tokyo Webmining Talk1
Tokyo Webmining Talk1
 
EuroMPI 2016 Keynote: How Can MPI Fit Into Today's Big Computing
EuroMPI 2016 Keynote: How Can MPI Fit Into Today's Big ComputingEuroMPI 2016 Keynote: How Can MPI Fit Into Today's Big Computing
EuroMPI 2016 Keynote: How Can MPI Fit Into Today's Big Computing
 
PAKDD2016 Tutorial DLIF: Introduction and Basics
PAKDD2016 Tutorial DLIF: Introduction and BasicsPAKDD2016 Tutorial DLIF: Introduction and Basics
PAKDD2016 Tutorial DLIF: Introduction and Basics
 
Serving BERT Models in Production with TorchServe
Serving BERT Models in Production with TorchServeServing BERT Models in Production with TorchServe
Serving BERT Models in Production with TorchServe
 
"Big Data" Bioinformatics
"Big Data" Bioinformatics"Big Data" Bioinformatics
"Big Data" Bioinformatics
 
Keynote (Mike Muller) - Is There Anything New in Heterogeneous Computing - by...
Keynote (Mike Muller) - Is There Anything New in Heterogeneous Computing - by...Keynote (Mike Muller) - Is There Anything New in Heterogeneous Computing - by...
Keynote (Mike Muller) - Is There Anything New in Heterogeneous Computing - by...
 
Data Science and Deep Learning on Spark with 1/10th of the Code with Roope As...
Data Science and Deep Learning on Spark with 1/10th of the Code with Roope As...Data Science and Deep Learning on Spark with 1/10th of the Code with Roope As...
Data Science and Deep Learning on Spark with 1/10th of the Code with Roope As...
 
“Introduction to the TVM Open Source Deep Learning Compiler Stack,” a Present...
“Introduction to the TVM Open Source Deep Learning Compiler Stack,” a Present...“Introduction to the TVM Open Source Deep Learning Compiler Stack,” a Present...
“Introduction to the TVM Open Source Deep Learning Compiler Stack,” a Present...
 
Narayanan Sundaram, Research Scientist, Intel Labs at MLconf SF - 11/13/15
Narayanan Sundaram, Research Scientist, Intel Labs at MLconf SF - 11/13/15Narayanan Sundaram, Research Scientist, Intel Labs at MLconf SF - 11/13/15
Narayanan Sundaram, Research Scientist, Intel Labs at MLconf SF - 11/13/15
 
Accelerate Your Python* Code through Profiling, Tuning, and Compilation Part ...
Accelerate Your Python* Code through Profiling, Tuning, and Compilation Part ...Accelerate Your Python* Code through Profiling, Tuning, and Compilation Part ...
Accelerate Your Python* Code through Profiling, Tuning, and Compilation Part ...
 
Tech Days 2015: User Presentation Vermont Technical College
Tech Days 2015: User Presentation Vermont Technical CollegeTech Days 2015: User Presentation Vermont Technical College
Tech Days 2015: User Presentation Vermont Technical College
 
IS-ENES COMP Superscalar tutorial
IS-ENES COMP Superscalar tutorialIS-ENES COMP Superscalar tutorial
IS-ENES COMP Superscalar tutorial
 
Biomedical Signal and Image Analytics using MATLAB
Biomedical Signal and Image Analytics using MATLABBiomedical Signal and Image Analytics using MATLAB
Biomedical Signal and Image Analytics using MATLAB
 
C++ Data-flow Parallelism sounds great! But how practical is it? Let’s see ho...
C++ Data-flow Parallelism sounds great! But how practical is it? Let’s see ho...C++ Data-flow Parallelism sounds great! But how practical is it? Let’s see ho...
C++ Data-flow Parallelism sounds great! But how practical is it? Let’s see ho...
 
Solve it Differently with Reactive Programming
Solve it Differently with Reactive ProgrammingSolve it Differently with Reactive Programming
Solve it Differently with Reactive Programming
 
Scalable and Distributed DNN Training on Modern HPC Systems
Scalable and Distributed DNN Training on Modern HPC SystemsScalable and Distributed DNN Training on Modern HPC Systems
Scalable and Distributed DNN Training on Modern HPC Systems
 
Task Parallel Library Data Flows
Task Parallel Library Data FlowsTask Parallel Library Data Flows
Task Parallel Library Data Flows
 

Ähnlich wie OpenMP tasking model: from the standard to the classroom

A look under the hood at Apache Spark's API and engine evolutions
A look under the hood at Apache Spark's API and engine evolutionsA look under the hood at Apache Spark's API and engine evolutions
A look under the hood at Apache Spark's API and engine evolutionsDatabricks
 
Performance Characterization and Optimization of In-Memory Data Analytics on ...
Performance Characterization and Optimization of In-Memory Data Analytics on ...Performance Characterization and Optimization of In-Memory Data Analytics on ...
Performance Characterization and Optimization of In-Memory Data Analytics on ...Ahsan Javed Awan
 
An introduction To Apache Spark
An introduction To Apache SparkAn introduction To Apache Spark
An introduction To Apache SparkAmir Sedighi
 
What is Distributed Computing, Why we use Apache Spark
What is Distributed Computing, Why we use Apache SparkWhat is Distributed Computing, Why we use Apache Spark
What is Distributed Computing, Why we use Apache SparkAndy Petrella
 
2018 03 25 system ml ai and openpower meetup
2018 03 25 system ml ai and openpower meetup2018 03 25 system ml ai and openpower meetup
2018 03 25 system ml ai and openpower meetupGanesan Narayanasamy
 
(Berkeley CS186 guest lecture) Big Data Analytics Systems: What Goes Around C...
(Berkeley CS186 guest lecture) Big Data Analytics Systems: What Goes Around C...(Berkeley CS186 guest lecture) Big Data Analytics Systems: What Goes Around C...
(Berkeley CS186 guest lecture) Big Data Analytics Systems: What Goes Around C...Reynold Xin
 
OmpSs – improving the scalability of OpenMP
OmpSs – improving the scalability of OpenMPOmpSs – improving the scalability of OpenMP
OmpSs – improving the scalability of OpenMPIntel IT Center
 
Near Data Computing Architectures: Opportunities and Challenges for Apache Spark
Near Data Computing Architectures: Opportunities and Challenges for Apache SparkNear Data Computing Architectures: Opportunities and Challenges for Apache Spark
Near Data Computing Architectures: Opportunities and Challenges for Apache SparkAhsan Javed Awan
 
Near Data Computing Architectures for Apache Spark: Challenges and Opportunit...
Near Data Computing Architectures for Apache Spark: Challenges and Opportunit...Near Data Computing Architectures for Apache Spark: Challenges and Opportunit...
Near Data Computing Architectures for Apache Spark: Challenges and Opportunit...Spark Summit
 
NumPy Roadmap presentation at NumFOCUS Forum
NumPy Roadmap presentation at NumFOCUS ForumNumPy Roadmap presentation at NumFOCUS Forum
NumPy Roadmap presentation at NumFOCUS ForumRalf Gommers
 
Unified Big Data Processing with Apache Spark
Unified Big Data Processing with Apache SparkUnified Big Data Processing with Apache Spark
Unified Big Data Processing with Apache SparkC4Media
 
Apache Spark 2.0: Faster, Easier, and Smarter
Apache Spark 2.0: Faster, Easier, and SmarterApache Spark 2.0: Faster, Easier, and Smarter
Apache Spark 2.0: Faster, Easier, and SmarterDatabricks
 
Profiling & Testing with Spark
Profiling & Testing with SparkProfiling & Testing with Spark
Profiling & Testing with SparkRoger Rafanell Mas
 
Apache Hadoop YARN - The Future of Data Processing with Hadoop
Apache Hadoop YARN - The Future of Data Processing with HadoopApache Hadoop YARN - The Future of Data Processing with Hadoop
Apache Hadoop YARN - The Future of Data Processing with HadoopHortonworks
 
Paris Data Geek - Spark Streaming
Paris Data Geek - Spark Streaming Paris Data Geek - Spark Streaming
Paris Data Geek - Spark Streaming Djamel Zouaoui
 
Swift Parallel Scripting for High-Performance Workflow
Swift Parallel Scripting for High-Performance WorkflowSwift Parallel Scripting for High-Performance Workflow
Swift Parallel Scripting for High-Performance WorkflowDaniel S. Katz
 

Ähnlich wie OpenMP tasking model: from the standard to the classroom (20)

A look under the hood at Apache Spark's API and engine evolutions
A look under the hood at Apache Spark's API and engine evolutionsA look under the hood at Apache Spark's API and engine evolutions
A look under the hood at Apache Spark's API and engine evolutions
 
Performance Characterization and Optimization of In-Memory Data Analytics on ...
Performance Characterization and Optimization of In-Memory Data Analytics on ...Performance Characterization and Optimization of In-Memory Data Analytics on ...
Performance Characterization and Optimization of In-Memory Data Analytics on ...
 
An introduction To Apache Spark
An introduction To Apache SparkAn introduction To Apache Spark
An introduction To Apache Spark
 
What is Distributed Computing, Why we use Apache Spark
What is Distributed Computing, Why we use Apache SparkWhat is Distributed Computing, Why we use Apache Spark
What is Distributed Computing, Why we use Apache Spark
 
2018 03 25 system ml ai and openpower meetup
2018 03 25 system ml ai and openpower meetup2018 03 25 system ml ai and openpower meetup
2018 03 25 system ml ai and openpower meetup
 
EEDC Programming Models
EEDC Programming ModelsEEDC Programming Models
EEDC Programming Models
 
(Berkeley CS186 guest lecture) Big Data Analytics Systems: What Goes Around C...
(Berkeley CS186 guest lecture) Big Data Analytics Systems: What Goes Around C...(Berkeley CS186 guest lecture) Big Data Analytics Systems: What Goes Around C...
(Berkeley CS186 guest lecture) Big Data Analytics Systems: What Goes Around C...
 
OmpSs – improving the scalability of OpenMP
OmpSs – improving the scalability of OpenMPOmpSs – improving the scalability of OpenMP
OmpSs – improving the scalability of OpenMP
 
Near Data Computing Architectures: Opportunities and Challenges for Apache Spark
Near Data Computing Architectures: Opportunities and Challenges for Apache SparkNear Data Computing Architectures: Opportunities and Challenges for Apache Spark
Near Data Computing Architectures: Opportunities and Challenges for Apache Spark
 
Near Data Computing Architectures for Apache Spark: Challenges and Opportunit...
Near Data Computing Architectures for Apache Spark: Challenges and Opportunit...Near Data Computing Architectures for Apache Spark: Challenges and Opportunit...
Near Data Computing Architectures for Apache Spark: Challenges and Opportunit...
 
NumPy Roadmap presentation at NumFOCUS Forum
NumPy Roadmap presentation at NumFOCUS ForumNumPy Roadmap presentation at NumFOCUS Forum
NumPy Roadmap presentation at NumFOCUS Forum
 
Unified Big Data Processing with Apache Spark
Unified Big Data Processing with Apache SparkUnified Big Data Processing with Apache Spark
Unified Big Data Processing with Apache Spark
 
Hadoop and Spark
Hadoop and SparkHadoop and Spark
Hadoop and Spark
 
Apache Spark 2.0: Faster, Easier, and Smarter
Apache Spark 2.0: Faster, Easier, and SmarterApache Spark 2.0: Faster, Easier, and Smarter
Apache Spark 2.0: Faster, Easier, and Smarter
 
Exascale Capabl
Exascale CapablExascale Capabl
Exascale Capabl
 
Profiling & Testing with Spark
Profiling & Testing with SparkProfiling & Testing with Spark
Profiling & Testing with Spark
 
Apache Hadoop YARN - The Future of Data Processing with Hadoop
Apache Hadoop YARN - The Future of Data Processing with HadoopApache Hadoop YARN - The Future of Data Processing with Hadoop
Apache Hadoop YARN - The Future of Data Processing with Hadoop
 
Paris Data Geek - Spark Streaming
Paris Data Geek - Spark Streaming Paris Data Geek - Spark Streaming
Paris Data Geek - Spark Streaming
 
Swift Parallel Scripting for High-Performance Workflow
Swift Parallel Scripting for High-Performance WorkflowSwift Parallel Scripting for High-Performance Workflow
Swift Parallel Scripting for High-Performance Workflow
 
System mldl meetup
System mldl meetupSystem mldl meetup
System mldl meetup
 

Mehr von Facultad de Informática UCM

¿Por qué debemos seguir trabajando en álgebra lineal?
¿Por qué debemos seguir trabajando en álgebra lineal?¿Por qué debemos seguir trabajando en álgebra lineal?
¿Por qué debemos seguir trabajando en álgebra lineal?Facultad de Informática UCM
 
TECNOPOLÍTICA Y ACTIVISMO DE DATOS: EL MAPEO COMO FORMA DE RESILIENCIA ANTE L...
TECNOPOLÍTICA Y ACTIVISMO DE DATOS: EL MAPEO COMO FORMA DE RESILIENCIA ANTE L...TECNOPOLÍTICA Y ACTIVISMO DE DATOS: EL MAPEO COMO FORMA DE RESILIENCIA ANTE L...
TECNOPOLÍTICA Y ACTIVISMO DE DATOS: EL MAPEO COMO FORMA DE RESILIENCIA ANTE L...Facultad de Informática UCM
 
DRAC: Designing RISC-V-based Accelerators for next generation Computers
DRAC: Designing RISC-V-based Accelerators for next generation ComputersDRAC: Designing RISC-V-based Accelerators for next generation Computers
DRAC: Designing RISC-V-based Accelerators for next generation ComputersFacultad de Informática UCM
 
Tendencias en el diseño de procesadores con arquitectura Arm
Tendencias en el diseño de procesadores con arquitectura ArmTendencias en el diseño de procesadores con arquitectura Arm
Tendencias en el diseño de procesadores con arquitectura ArmFacultad de Informática UCM
 
Introduction to Quantum Computing and Quantum Service Oriented Computing
Introduction to Quantum Computing and Quantum Service Oriented ComputingIntroduction to Quantum Computing and Quantum Service Oriented Computing
Introduction to Quantum Computing and Quantum Service Oriented ComputingFacultad de Informática UCM
 
Inteligencia Artificial en la atención sanitaria del futuro
Inteligencia Artificial en la atención sanitaria del futuroInteligencia Artificial en la atención sanitaria del futuro
Inteligencia Artificial en la atención sanitaria del futuroFacultad de Informática UCM
 
Design Automation Approaches for Real-Time Edge Computing for Science Applic...
 Design Automation Approaches for Real-Time Edge Computing for Science Applic... Design Automation Approaches for Real-Time Edge Computing for Science Applic...
Design Automation Approaches for Real-Time Edge Computing for Science Applic...Facultad de Informática UCM
 
Estrategias de navegación para robótica móvil de campo: caso de estudio proye...
Estrategias de navegación para robótica móvil de campo: caso de estudio proye...Estrategias de navegación para robótica móvil de campo: caso de estudio proye...
Estrategias de navegación para robótica móvil de campo: caso de estudio proye...Facultad de Informática UCM
 
Fault-tolerance Quantum computation and Quantum Error Correction
Fault-tolerance Quantum computation and Quantum Error CorrectionFault-tolerance Quantum computation and Quantum Error Correction
Fault-tolerance Quantum computation and Quantum Error CorrectionFacultad de Informática UCM
 
Cómo construir un chatbot inteligente sin morir en el intento
Cómo construir un chatbot inteligente sin morir en el intentoCómo construir un chatbot inteligente sin morir en el intento
Cómo construir un chatbot inteligente sin morir en el intentoFacultad de Informática UCM
 
Automatic generation of hardware memory architectures for HPC
Automatic generation of hardware memory architectures for HPCAutomatic generation of hardware memory architectures for HPC
Automatic generation of hardware memory architectures for HPCFacultad de Informática UCM
 
Hardware/software security contracts: Principled foundations for building sec...
Hardware/software security contracts: Principled foundations for building sec...Hardware/software security contracts: Principled foundations for building sec...
Hardware/software security contracts: Principled foundations for building sec...Facultad de Informática UCM
 
Jose carlossancho slidesLa seguridad en el desarrollo de software implementad...
Jose carlossancho slidesLa seguridad en el desarrollo de software implementad...Jose carlossancho slidesLa seguridad en el desarrollo de software implementad...
Jose carlossancho slidesLa seguridad en el desarrollo de software implementad...Facultad de Informática UCM
 
Redes neuronales y reinforcement learning. Aplicación en energía eólica.
Redes neuronales y reinforcement learning. Aplicación en energía eólica.Redes neuronales y reinforcement learning. Aplicación en energía eólica.
Redes neuronales y reinforcement learning. Aplicación en energía eólica.Facultad de Informática UCM
 
Challenges and Opportunities for AI and Data analytics in Offshore wind
Challenges and Opportunities for AI and Data analytics in Offshore windChallenges and Opportunities for AI and Data analytics in Offshore wind
Challenges and Opportunities for AI and Data analytics in Offshore windFacultad de Informática UCM
 

Mehr von Facultad de Informática UCM (20)

¿Por qué debemos seguir trabajando en álgebra lineal?
¿Por qué debemos seguir trabajando en álgebra lineal?¿Por qué debemos seguir trabajando en álgebra lineal?
¿Por qué debemos seguir trabajando en álgebra lineal?
 
TECNOPOLÍTICA Y ACTIVISMO DE DATOS: EL MAPEO COMO FORMA DE RESILIENCIA ANTE L...
TECNOPOLÍTICA Y ACTIVISMO DE DATOS: EL MAPEO COMO FORMA DE RESILIENCIA ANTE L...TECNOPOLÍTICA Y ACTIVISMO DE DATOS: EL MAPEO COMO FORMA DE RESILIENCIA ANTE L...
TECNOPOLÍTICA Y ACTIVISMO DE DATOS: EL MAPEO COMO FORMA DE RESILIENCIA ANTE L...
 
DRAC: Designing RISC-V-based Accelerators for next generation Computers
DRAC: Designing RISC-V-based Accelerators for next generation ComputersDRAC: Designing RISC-V-based Accelerators for next generation Computers
DRAC: Designing RISC-V-based Accelerators for next generation Computers
 
uElectronics ongoing activities at ESA
uElectronics ongoing activities at ESAuElectronics ongoing activities at ESA
uElectronics ongoing activities at ESA
 
Tendencias en el diseño de procesadores con arquitectura Arm
Tendencias en el diseño de procesadores con arquitectura ArmTendencias en el diseño de procesadores con arquitectura Arm
Tendencias en el diseño de procesadores con arquitectura Arm
 
Formalizing Mathematics in Lean
Formalizing Mathematics in LeanFormalizing Mathematics in Lean
Formalizing Mathematics in Lean
 
Introduction to Quantum Computing and Quantum Service Oriented Computing
Introduction to Quantum Computing and Quantum Service Oriented ComputingIntroduction to Quantum Computing and Quantum Service Oriented Computing
Introduction to Quantum Computing and Quantum Service Oriented Computing
 
Computer Design Concepts for Machine Learning
Computer Design Concepts for Machine LearningComputer Design Concepts for Machine Learning
Computer Design Concepts for Machine Learning
 
Inteligencia Artificial en la atención sanitaria del futuro
Inteligencia Artificial en la atención sanitaria del futuroInteligencia Artificial en la atención sanitaria del futuro
Inteligencia Artificial en la atención sanitaria del futuro
 
Design Automation Approaches for Real-Time Edge Computing for Science Applic...
 Design Automation Approaches for Real-Time Edge Computing for Science Applic... Design Automation Approaches for Real-Time Edge Computing for Science Applic...
Design Automation Approaches for Real-Time Edge Computing for Science Applic...
 
Estrategias de navegación para robótica móvil de campo: caso de estudio proye...
Estrategias de navegación para robótica móvil de campo: caso de estudio proye...Estrategias de navegación para robótica móvil de campo: caso de estudio proye...
Estrategias de navegación para robótica móvil de campo: caso de estudio proye...
 
Fault-tolerance Quantum computation and Quantum Error Correction
Fault-tolerance Quantum computation and Quantum Error CorrectionFault-tolerance Quantum computation and Quantum Error Correction
Fault-tolerance Quantum computation and Quantum Error Correction
 
Cómo construir un chatbot inteligente sin morir en el intento
Cómo construir un chatbot inteligente sin morir en el intentoCómo construir un chatbot inteligente sin morir en el intento
Cómo construir un chatbot inteligente sin morir en el intento
 
Automatic generation of hardware memory architectures for HPC
Automatic generation of hardware memory architectures for HPCAutomatic generation of hardware memory architectures for HPC
Automatic generation of hardware memory architectures for HPC
 
Type and proof structures for concurrency
Type and proof structures for concurrencyType and proof structures for concurrency
Type and proof structures for concurrency
 
Hardware/software security contracts: Principled foundations for building sec...
Hardware/software security contracts: Principled foundations for building sec...Hardware/software security contracts: Principled foundations for building sec...
Hardware/software security contracts: Principled foundations for building sec...
 
Jose carlossancho slidesLa seguridad en el desarrollo de software implementad...
Jose carlossancho slidesLa seguridad en el desarrollo de software implementad...Jose carlossancho slidesLa seguridad en el desarrollo de software implementad...
Jose carlossancho slidesLa seguridad en el desarrollo de software implementad...
 
Do you trust your artificial intelligence system?
Do you trust your artificial intelligence system?Do you trust your artificial intelligence system?
Do you trust your artificial intelligence system?
 
Redes neuronales y reinforcement learning. Aplicación en energía eólica.
Redes neuronales y reinforcement learning. Aplicación en energía eólica.Redes neuronales y reinforcement learning. Aplicación en energía eólica.
Redes neuronales y reinforcement learning. Aplicación en energía eólica.
 
Challenges and Opportunities for AI and Data analytics in Offshore wind
Challenges and Opportunities for AI and Data analytics in Offshore windChallenges and Opportunities for AI and Data analytics in Offshore wind
Challenges and Opportunities for AI and Data analytics in Offshore wind
 

Kürzlich hochgeladen

Gender Bias in Engineer, Honors 203 Project
Gender Bias in Engineer, Honors 203 ProjectGender Bias in Engineer, Honors 203 Project
Gender Bias in Engineer, Honors 203 Projectreemakb03
 
ASME BPVC 2023 Section I para leer y entender
ASME BPVC 2023 Section I para leer y entenderASME BPVC 2023 Section I para leer y entender
ASME BPVC 2023 Section I para leer y entenderjuancarlos286641
 
Engineering Mechanics Chapter 5 Equilibrium of a Rigid Body
Engineering Mechanics  Chapter 5  Equilibrium of a Rigid BodyEngineering Mechanics  Chapter 5  Equilibrium of a Rigid Body
Engineering Mechanics Chapter 5 Equilibrium of a Rigid BodyAhmadHajasad2
 
cloud computing notes for anna university syllabus
cloud computing notes for anna university syllabuscloud computing notes for anna university syllabus
cloud computing notes for anna university syllabusViolet Violet
 
ChatGPT-and-Generative-AI-Landscape Working of generative ai search
ChatGPT-and-Generative-AI-Landscape Working of generative ai searchChatGPT-and-Generative-AI-Landscape Working of generative ai search
ChatGPT-and-Generative-AI-Landscape Working of generative ai searchrohitcse52
 
me3493 manufacturing technology unit 1 Part A
me3493 manufacturing technology unit 1 Part Ame3493 manufacturing technology unit 1 Part A
me3493 manufacturing technology unit 1 Part Akarthi keyan
 
Dev.bg DevOps March 2024 Monitoring & Logging
Dev.bg DevOps March 2024 Monitoring & LoggingDev.bg DevOps March 2024 Monitoring & Logging
Dev.bg DevOps March 2024 Monitoring & LoggingMarian Marinov
 
nvidia AI-gtc 2024 partial slide deck.pptx
nvidia AI-gtc 2024 partial slide deck.pptxnvidia AI-gtc 2024 partial slide deck.pptx
nvidia AI-gtc 2024 partial slide deck.pptxjasonsedano2
 
How to Write a Good Scientific Paper.pdf
How to Write a Good Scientific Paper.pdfHow to Write a Good Scientific Paper.pdf
How to Write a Good Scientific Paper.pdfRedhwan Qasem Shaddad
 
Quasi-Stochastic Approximation: Algorithm Design Principles with Applications...
Quasi-Stochastic Approximation: Algorithm Design Principles with Applications...Quasi-Stochastic Approximation: Algorithm Design Principles with Applications...
Quasi-Stochastic Approximation: Algorithm Design Principles with Applications...Sean Meyn
 
Clutches and brkesSelect any 3 position random motion out of real world and d...
Clutches and brkesSelect any 3 position random motion out of real world and d...Clutches and brkesSelect any 3 position random motion out of real world and d...
Clutches and brkesSelect any 3 position random motion out of real world and d...sahb78428
 
Graphics Primitives and CG Display Devices
Graphics Primitives and CG Display DevicesGraphics Primitives and CG Display Devices
Graphics Primitives and CG Display DevicesDIPIKA83
 
Lecture 1: Basics of trigonometry (surveying)
Lecture 1: Basics of trigonometry (surveying)Lecture 1: Basics of trigonometry (surveying)
Lecture 1: Basics of trigonometry (surveying)Bahzad5
 
Best-NO1 Best Rohani Amil In Lahore Kala Ilam In Lahore Kala Jadu Amil In Lah...
Best-NO1 Best Rohani Amil In Lahore Kala Ilam In Lahore Kala Jadu Amil In Lah...Best-NO1 Best Rohani Amil In Lahore Kala Ilam In Lahore Kala Jadu Amil In Lah...
Best-NO1 Best Rohani Amil In Lahore Kala Ilam In Lahore Kala Jadu Amil In Lah...Amil baba
 
Strategies of Urban Morphologyfor Improving Outdoor Thermal Comfort and Susta...
Strategies of Urban Morphologyfor Improving Outdoor Thermal Comfort and Susta...Strategies of Urban Morphologyfor Improving Outdoor Thermal Comfort and Susta...
Strategies of Urban Morphologyfor Improving Outdoor Thermal Comfort and Susta...amrabdallah9
 
Popular-NO1 Kala Jadu Expert Specialist In Germany Kala Jadu Expert Specialis...
Popular-NO1 Kala Jadu Expert Specialist In Germany Kala Jadu Expert Specialis...Popular-NO1 Kala Jadu Expert Specialist In Germany Kala Jadu Expert Specialis...
Popular-NO1 Kala Jadu Expert Specialist In Germany Kala Jadu Expert Specialis...Amil baba
 

Kürzlich hochgeladen (20)

Gender Bias in Engineer, Honors 203 Project
Gender Bias in Engineer, Honors 203 ProjectGender Bias in Engineer, Honors 203 Project
Gender Bias in Engineer, Honors 203 Project
 
ASME BPVC 2023 Section I para leer y entender
ASME BPVC 2023 Section I para leer y entenderASME BPVC 2023 Section I para leer y entender
ASME BPVC 2023 Section I para leer y entender
 
Litature Review: Research Paper work for Engineering
Litature Review: Research Paper work for EngineeringLitature Review: Research Paper work for Engineering
Litature Review: Research Paper work for Engineering
 
Engineering Mechanics Chapter 5 Equilibrium of a Rigid Body
Engineering Mechanics  Chapter 5  Equilibrium of a Rigid BodyEngineering Mechanics  Chapter 5  Equilibrium of a Rigid Body
Engineering Mechanics Chapter 5 Equilibrium of a Rigid Body
 
cloud computing notes for anna university syllabus
cloud computing notes for anna university syllabuscloud computing notes for anna university syllabus
cloud computing notes for anna university syllabus
 
ChatGPT-and-Generative-AI-Landscape Working of generative ai search
ChatGPT-and-Generative-AI-Landscape Working of generative ai searchChatGPT-and-Generative-AI-Landscape Working of generative ai search
ChatGPT-and-Generative-AI-Landscape Working of generative ai search
 
Présentation IIRB 2024 Chloe Dufrane.pdf
Présentation IIRB 2024 Chloe Dufrane.pdfPrésentation IIRB 2024 Chloe Dufrane.pdf
Présentation IIRB 2024 Chloe Dufrane.pdf
 
me3493 manufacturing technology unit 1 Part A
me3493 manufacturing technology unit 1 Part Ame3493 manufacturing technology unit 1 Part A
me3493 manufacturing technology unit 1 Part A
 
Dev.bg DevOps March 2024 Monitoring & Logging
Dev.bg DevOps March 2024 Monitoring & LoggingDev.bg DevOps March 2024 Monitoring & Logging
Dev.bg DevOps March 2024 Monitoring & Logging
 
nvidia AI-gtc 2024 partial slide deck.pptx
nvidia AI-gtc 2024 partial slide deck.pptxnvidia AI-gtc 2024 partial slide deck.pptx
nvidia AI-gtc 2024 partial slide deck.pptx
 
How to Write a Good Scientific Paper.pdf
How to Write a Good Scientific Paper.pdfHow to Write a Good Scientific Paper.pdf
How to Write a Good Scientific Paper.pdf
 
Quasi-Stochastic Approximation: Algorithm Design Principles with Applications...
Quasi-Stochastic Approximation: Algorithm Design Principles with Applications...Quasi-Stochastic Approximation: Algorithm Design Principles with Applications...
Quasi-Stochastic Approximation: Algorithm Design Principles with Applications...
 
Lecture 2 .pptx
Lecture 2                            .pptxLecture 2                            .pptx
Lecture 2 .pptx
 
Clutches and brkesSelect any 3 position random motion out of real world and d...
Clutches and brkesSelect any 3 position random motion out of real world and d...Clutches and brkesSelect any 3 position random motion out of real world and d...
Clutches and brkesSelect any 3 position random motion out of real world and d...
 
Présentation IIRB 2024 Marine Cordonnier.pdf
Présentation IIRB 2024 Marine Cordonnier.pdfPrésentation IIRB 2024 Marine Cordonnier.pdf
Présentation IIRB 2024 Marine Cordonnier.pdf
 
Graphics Primitives and CG Display Devices
Graphics Primitives and CG Display DevicesGraphics Primitives and CG Display Devices
Graphics Primitives and CG Display Devices
 
Lecture 1: Basics of trigonometry (surveying)
Lecture 1: Basics of trigonometry (surveying)Lecture 1: Basics of trigonometry (surveying)
Lecture 1: Basics of trigonometry (surveying)
 
Best-NO1 Best Rohani Amil In Lahore Kala Ilam In Lahore Kala Jadu Amil In Lah...
Best-NO1 Best Rohani Amil In Lahore Kala Ilam In Lahore Kala Jadu Amil In Lah...Best-NO1 Best Rohani Amil In Lahore Kala Ilam In Lahore Kala Jadu Amil In Lah...
Best-NO1 Best Rohani Amil In Lahore Kala Ilam In Lahore Kala Jadu Amil In Lah...
 
Strategies of Urban Morphologyfor Improving Outdoor Thermal Comfort and Susta...
Strategies of Urban Morphologyfor Improving Outdoor Thermal Comfort and Susta...Strategies of Urban Morphologyfor Improving Outdoor Thermal Comfort and Susta...
Strategies of Urban Morphologyfor Improving Outdoor Thermal Comfort and Susta...
 
Popular-NO1 Kala Jadu Expert Specialist In Germany Kala Jadu Expert Specialis...
Popular-NO1 Kala Jadu Expert Specialist In Germany Kala Jadu Expert Specialis...Popular-NO1 Kala Jadu Expert Specialist In Germany Kala Jadu Expert Specialis...
Popular-NO1 Kala Jadu Expert Specialist In Germany Kala Jadu Expert Specialis...
 

OpenMP tasking model: from the standard to the classroom

  • 1. Universitat Politècnica de Catalunya BARCELONA TECH Madrid. March 30, 2017 Eduard Ayguadé Dep. Arquitectura de Computadors, U. Politècnica de Catalunya Computer Sciences Dep., Barcelona Supercomputing Center OpenMP tasking model: from the standard to the classroom
  • 2. ICT386 multiprocessor architecture, late 80s 1 4 x CPU module i386/i387 no private cache memory 4 x Memory module 4 MB DRAM (100 ns) 64 KB shared cache 25 ns, direct-mapped
  • 3. ICT386 multiprocessor architecture, late 80s 2 Crossbar 5x5 Gate array design 10 ns transfer Intellligent memory Memory mapped 1 MB DRAM Functions: management, synchronization and scheduling ProgrammableNIC
  • 4. And the wall has become higher and higher … 3 big.LITTLE Integrated GPU Socket S1 S2 GPU G D D R GPU G D D R D D R D D R FPGA N V M HCA CAPIPCIe PCIePCIe QPI Node Rank Name Site Computer Total Cores Accelerator/ Co-Proc. Cores Rmax (Gflops) Rpeak (Gflops) Power (KW) Mflops/Wa 1 Sunway TaihuLight National SupercomputingCenter in Wuxi Sunway MPP, Sunway SW26010 260C 1.45GHz, Sunway 10649600 0 93014593,88 125435904 15371 6051, 2 Tianhe-2 (MilkyWay-2) National Super Computer Center in Guangzhou TH-IVB-FEP Cluster, Intel Xeon E5-269212C 2.200GHz, TH Express-2, Intel Xeon Phi 31S1P 3120000 2736000 33862700 54902400 17808 1901, 3 Titan DOE/SC/Oak Ridge National Laboratory Cray XK7, Opteron 6274 16C 2.200GHz, Cray Gemini interconnect, NVIDIA K20x 560640 261632 17590000 27112550 8209 2142, 4 Sequoia DOE/NNSA/LLNL BlueGene/Q,Power BQC 16C 1.60 GHz, Custom 1572864 0 17173224 20132659,2 7890 2176, 5 Cori DOE/SC/LBNL/NERSC Cray XC40, Intel Xeon Phi 7250 68C 1.4GHz, Aries interconnect 622336 0 14014700 27880653 3939 3557, 6 Oakforest- PACS Joint Center for Advanced High Performance Computing PRIMERGY CX1640 M1, Intel Xeon Phi 7250 68C 1.4GHz, Intel Omni-Path 556104 0 13554600 24913459 2718 4985, 7 K computer RIKEN Advanced Institute for Computational Science (AICS) K computer, SPARC64 VIIIfx 2.0GHz, Tofu interconnect 705024 0 10510000 11280384 12660 830, 8 Piz Daint Swiss National SupercomputingCentre (CSCS) Cray XC50, Xeon E5-2690v3 12C 2.6GHz, Aries interconnect , NVIDIA Tesla P100 206720 170240 9779000 15987968 1312 7453, 9 Mira DOE/SC/Argonne National Laboratory BlueGene/Q,Power BQC 16C 1.60GHz, Custom 786432 0 8586612 10066330 3945 2176, 10 Trinity DOE/NNSA/LANL/SNL Cray XC40, Xeon E5-2698v3 16C 2.3GHz, Aries interconnect 301056 0 8100900 11078861 4232 1913, Top5 November 2016 list
  • 5. 1979 Pink Floyd’s The Wall album, Cray-1 delivering 80 Mflops, single processor OpenMP evolution and OmpSs influence Runtime opportunities Architecture opportunities Teaching opportunities Programmability
  • 6. The vision Plethora of architectures • Heterogeneity • Hierarchies ISA Low-level interfaces ISA/PM leaks and variability/faults/errors everywhere • Programmer exposed • Divergence between mental models and actual behavior New application domains Programming models Applications Programmability and porting/inter-operability • Legacy codes • ”Legacy” mental models and habits APIs New usage practices • Analytics and visualization • New data models • Interactive supercomputing, response time • Virtualized data centers • Cloud Programming interfaces Ana… …lytics 9
  • 7. The vision 10 Intelligent runtime, parallelization, distribution, interoperability General purpose Task based Single address space Program logic independent of computing platform Performance- and power-driven optimization/tuning Resource management ISA Low-level interfaces APIs Runtime system Resource management Performance, power, energy… monitoring/insight Applications Programming model: high-level, clean, abstract interface
  • 8. 7 The OmpSs forerunner for OpenMP Proposing ideas … – IWOMP workshop is the ideal scenario – Participation in the OpenMP language committee and sub-committees Demonstrated with … – Prototype implementation: Mercurium compiler and Nanos runtime – Benchmark/application suite Lots of discussions and final polishing 3.02008 + Tasks + Task dependences + Task priorities + Taskloop prototyping + Task reductions + Dependences on taskwait + Multidependences + Commutative + Dependences on taskloop Today 3.12011 4.02013 4.52015 5.0? OmpSs
  • 9. 8 OmpSs influencing tasking model evolution (1) The big change … From loop-based model to task-based model – Dynamically allocated data structures (lists, trees, …) – Recursion task generation waits for child tasks to finish (3.0) task-based model, nesting of tasks (3.0) #pragma omp task firstprivate(list) { code block } #pragma omp taskwait task priorities (4.5) priority(value) task generation cut-off (3.1) final(expr) root
  • 10. 9 OmpSs influencing tasking model evolution (2) Directionality of task arguments Dependences between tasks computed at runtime Tasks executed as soon as all dependences are satisfied directionality for task arguments (4.0) #pragma omp task depend(in: list, out: list, inout: list) { code block }
  • 11. 10 OmpSs influencing tasking model evolution (3) Tasks become chunks of iterations of one or more associated loops Dependences between tasks generated in the same or different taskloop construct under discussion tasks out of loops and granularity control (4.5) #pragma omp taskloop [num_tasks(value) | grainsize(value)] for-loops #pragma omp taskloop block(2) grainsize(B, B) depend(in : block[i-1][j], block[i][j-1]) depend(in : block[i+1][j], block[i][j+1]) depend(out: block[i][j]) for (i=1; i<n i++) for (j=1; j<n;j++) foo(i, j); T T T T T T T T T T T T T T T T
  • 12. 11 OmpSs influencing tasking model evolution (4) Reduction operations on task and taskloop not yet supported, currently in discussion #pragma omp taskgroup reduction(+: res) { while (node) { #pragma omp task in_reduction(+: res) res += node->value; node = node->next; } } #pragma omp taskloop grainsize(BS) reduction(+: res) for (i = 0; i < N; ++i) { res += foo(v[i]); }
  • 13. 12 OmpSs influencing tasking model evolution (5) The number of task dependences have to be constant at compile time à multidependences proposal under discussion for (i = 0; i < l.size(); ++i) { #pragma omp task depend(inout: l[i]) // T1 { ... } } #pragma omp task depend(in: {l[j], j=0:l.size()}) // T2 for (i = 0; i < l.size(); ++i) { ... } Defines an iterator with a range of values that only exists in the context of the directive. Equivalent to: depend(in: l[0], l[1], ..., l[l.size()-1]) T1 T1 T1 T1 T2 …
  • 14. 13 OmpSs influencing tasking model evolution (6) New dependence type that relaxes the execution order of commutative tasks but keeping mutual exclusion between them T1 T3 T2 T2 for (i = 0; i < l.size(); ++i) { #pragma omp task depend(out: l[i]) // T1 } for (i = 0; i < l.size(); ++i) { #pragma omp task shared(x) depend(in: l[i]) depend(inout: x) // T2 } #pragma omp task shared(x) depend(in: x) // T3 for (i = 0; i < l.size(); ++i) { ... } T1 T1 T1 T2 T2
  • 15. 14 OmpSs influencing tasking model evolution (6) New dependence type that relaxes the execution order of commutative tasks but keeping mutual exclusion between them T1 T3 T2 T2 for (i = 0; i < l.size(); ++i) { #pragma omp task depend(out: l[i]) // T1 } for (i = 0; i < l.size(); ++i) { #pragma omp task shared(x) depend(in: l[i]) depend(commutative: x) // T2 } #pragma omp task shared(x) depend(in: x) // T3 for (i = 0; i < l.size(); ++i) { ... } T1 T1 T1 T2 T2
  • 16. Scheduling New tasks Dynamic task graph Worker threads Execute task Nanos++runtime … newTask(call0, data0); newTask(call1, data1); … waitTasks() … Application binary Ready task queue End task: dependence update 15 OmpSs: runtime opportunities (1) Runtime parallelization – Compute data dependences among tasks à dynamic task graph – Dataflow execution of tasks à ready task queue – Look-ahead: task instantiation can proceed past non-ready tasks (with opportunities to find distant parallelism)
  • 17. 16 OmpSs: runtime opportunities (2) Criticallity-aware scheduler: managing heterogeneous SoC – Tasks in the critical path detected at runtime à execute in faster cores – Non-critical tasks executed in slower, more power-efficient cores – Work-stealing for load balancing “Criticality-Aware Dynamic Task Scheduling on Heterogeneous Architectures”. K. Chronaki et al. ICS-2015 0,7 0,8 0,9 1 1,1 1,2 1,3 Cholesky Int. Hist QR Heat AVG Speedup Original CATS
  • 18. Scheduling New tasks Dynamic task graph Worker threads Execute task Nanos++runtime … newTask(call0, data0); newTask(call1, data1); … waitTasks() … Application binary Ready task queue End task: dependence update 17 OmpSs: runtime opportunities (3) Runtime parallelization Directory with per-core arguments utilization – Locality-aware task scheduling strategies – Driving argument prefetching Directory
  • 19. 18 OmpSs: runtime opportunities (4) Management of address spaces and coherence – Directory@master: identifies for every region reference where data is – Per-device software cache: translate host to device address spaces – Use of device specific mechanisms to move data, if needed – Locality-aware scheduling when multiple devices exist Scheduling Dynamic task graph Nanos++runtime Ready task queue End task: dependence update Worker threads Helper threads Execute task MIC threads CUDA threads Data requests Directory / Caches “Self-Adaptive OmpSs Tasks in Heterogeneous Environments”. J. Planas et al., IPDPS 2013.
  • 20. Example: nbody #pragma omp target device(smp) copy_deps #pragma omp task depend(out: [size]out), in: [npart]part) void comp_force (int size, float time, int npart, Part* part, Particle *out, int gid); #pragma omp target device(fpga) ndrange(1,size,128) copy_deps implements (comp_force) #pragma omp task depend(out: [size]out), in: [npart]part) __kernel void comp_force_opencl (int size, float time, int npart, __global Part* part, __global Part* out, int gid); #pragma omp target device(cuda) ndrange(1,size,128) copy_deps implements (comp_force) #pragma omp task depend(out: [size]out), in: [npart]part) __global__ void comp_force_cuda (int size, float time, int npar, Part* part, Particle *out, int gid); void Particle_array_calculate_forces(Particle* input, Particle *output, int npart, float time) { for (int i = 0; i < npart; i += BS ) comp_force (BS, time, npart, input, &output[i], i); }
  • 21. 20 OmpSs: runtime opportunities (5) Support for multiple implementations for same task – Versioning scheduler: • Monitoring (learning) phase and scheduling phase • Most suitable implementation is selected each time depending on monitoring, simple model and OmpSs worker’s work load SMP W1 SMP W2 GPU W3 Faster but busy Slower but early executor Run task! “Self-Adaptive OmpSs Tasks in Heterogeneous Environments”. J. Planas et al., IPDPS 2013.
  • 22. 21 OmpSs: runtime opportunities (6) At cluster … – Task creation starts at master node, outermost tasks may be executed on remote nodes and spawn inner tasks – Locality-aware scheduler, overlap transfers/computation, work stealing Scheduling Directory/Cache Nanos++runtime masternode Worker threads Helper threads Execute task Data requests Worker threads Helper threads Nanos++runtime remotenode data in host, devices attached to node, and remote nodes data in host, devices attached to node, NIC NIC “Productive Cluster Programming with OmpSs”. J. Bueno et al. EuroPar 2011. “Productive Programming of GPU Clusters with OmpSs”. J. Bueno et al., IPDPS 2012.
  • 23. 22 OmpSs: architecture opportunities (1) Novel hybrid cache/local memory – Cache: irregular accesses – Local memory (energy efficient, less coherence traffic): strided accesses Runtime-assisted prefetching • Task inputs and outputs mapped to the LMs • Overlap with runtime • Double buffering between tasks 8.7% speedup, 14% reduction in power, 20% reduction in network-on-chip traffic C C L1 Cluster Interconnect LM L1 LM C C L1 LM L1 LM C C L1 LM L1 LM C C L1 LM L1 LM DRAM DRAM L2 L3 “Runtime-Guided Management of Scratchpad Memories in Multicores.” Ll. Alvarez et al. Ç PACT-2015
  • 24. OmpSs: architecture opportunities (3) Task dependence manager to speedup task management and dependence tracking Hardware support for reductions – Privatization of variables to speedup reduction à Significant overhead when performing reductions on a vector or matrix block. – HW support to perform simple reduction operations (+, -, *, max, min) in the last-level cache or in memory à In-memory computation X. Tan et al. Performance Analysis of a Hardware Accelerator of Dependency Management for Task-based Dataflow Programming models. ISPASS 2016. X. Tan et al. General Purpose Task-Dependence Management Hardware for Task-based Dataflow Programming Models. IPDPS 2017. J. Ciesko et al. Supporting Adaptive Privatization Techniques for Irregular Array Reductions in Task-parallel Programming Models. IWOMP 2016.
  • 25. 24 OmpSs: architecture opportunities (2) CATA: criticallity-aware task acceleration – Runtime drives per-core DVFS reconfigurations meeting a global power budget – Hardware Runtime Support Unit (RSU) to minimize overheads that grow with the number of cores E. Castillo et al., CATA: Criticality Aware Task Acceleration for Multicore Processors. IPDPS 2016. 80% 90% 100% 110% 120% 130% 140% 150% Performance imprv EDP imprv Original CATS CATA CATA+RSU 32-core system with 16 fast cores
  • 26. 25 Teaching opportunities (1) Two courses in Computer Science BSc @ FIB (UPC) – PAR – Parallelism: 5th semester, mandatory course – PAP – Parallel Architectures and Programming: optional, computer engineering specialization OpenMP task-based programming @ node level
  • 27. 26 Teaching opportunities (2) Syllabus • Tasks, task graph, parallelism • Overheads and model: task creation, synchronization, data sharing • Task decomposition: linear, iterative, recursive • Task ordering and data sharing constraints • The magic of the architecture: cache coherency and synchronization support • Data decomposition and influence of memory OpenMP tasks at its heart, going towards task-only parallel programming • Embarrassingly parallel, recursive task decomposition, task/data decomposition, and branch and bound • Tareador tool for the exploration of task decomposition strategies P PAR Innovation: (pseudo-) flipped-classroom model
  • 28. 27 Teaching opportunities (2) PAP Syllabus • Compiler and runtime support to OpenMP • Pthreads API and lib intrinsics • Designing an HPC cluster: components and interconnect, roofline model for flops/bw • Message-passing interface MPI OpenMP for node programming, with MPI towards cluster programming • OpenMP tasking review • Implementing a runtime to support OpenMP • Building a 4xOdroid cluster: hard and soft installation: the effect of heterogeneity • Programming the 4xOdroid cluster with MPI: Stencil with Jacobi Innovation: puzzle work in groups when designing cluster