SlideShare ist ein Scribd-Unternehmen logo
1 von 39
Downloaden Sie, um offline zu lesen
Introduction and Motivation Parallel Programming Models Machine Learning Techniques Comparison
A Comparison of GPU Execution Time Prediction using
Machine Learning and Analytical Modeling
Ph.D(c) CS Marcos Amar´ıs Gonz´alez
Advisor: Dr. Alfredo Goldman vel Lejbman
Co-advisor: Dr. Raphael Yokoingawa de Camargo
December, 2016
(gold, amaris)@ime.usp.br (IME - USP) BSP-based model Vs. Machine Learning December, 2016 1 / 32
Introduction and Motivation Parallel Programming Models Machine Learning Techniques Comparison
Timeline
1 Introduction and Motivation
2 Parallel Programming Models
BSP-based Analytical Model for GPUs
3 Machine Learning Techniques
4 Comparison
Methodology
Results
Conclusions and Future Works
(gold, amaris)@ime.usp.br (IME - USP) BSP-based model Vs. Machine Learning December, 2016 1 / 32
Introduction and Motivation Parallel Programming Models Machine Learning Techniques Comparison
BSP-based model Vs. Machine Learning
1 Introduction and Motivation
2 Parallel Programming Models
3 Machine Learning Techniques
4 Comparison
(gold, amaris)@ime.usp.br (IME - USP) BSP-based model Vs. Machine Learning December, 2016 1 / 32
Introduction and Motivation Parallel Programming Models Machine Learning Techniques Comparison
Games and Video Cards
80’ - First video driver
Evolution of the games 3D. It is nec-
essary to apply textures, lights, shad-
ows, reflections, etc.
It was also necessary more computing
power
For this, the video cards became to
be more flexible and powerful
(gold, amaris)@ime.usp.br (IME - USP) BSP-based model Vs. Machine Learning December, 2016 2 / 32
Introduction and Motivation Parallel Programming Models Machine Learning Techniques Comparison
Graphic Processing Units - GPUs
The term GPU was popularized by Nvidia in
1999, who invented a GeForce 256 like the first
GPU in the world.
In 2002 the first General Purpose GPU was
launched. The term GPGPU was created by
Mark Harris.
The main manufacturer of GPUs are NVIDIA
and AMD. In 2005 NVIDIA launched CUDA.
Deep Learning, Virtual Reality.
(gold, amaris)@ime.usp.br (IME - USP) BSP-based model Vs. Machine Learning December, 2016 3 / 32
Introduction and Motivation Parallel Programming Models Machine Learning Techniques Comparison
General Purpose GPU - GPGPU
Main program execute in the CPU (host) and it is responsible to start the execution
in the GPU (device).
These GPUs have their own hierarchy of memory and data must be transfered
through the PCI Express.
(gold, amaris)@ime.usp.br (IME - USP) BSP-based model Vs. Machine Learning December, 2016 4 / 32
Introduction and Motivation Parallel Programming Models Machine Learning Techniques Comparison
GPU Versus CPU
Nowadays GPUs are capable to perform much more efficient computing
operations than CPUs multicores.
(gold, amaris)@ime.usp.br (IME - USP) BSP-based model Vs. Machine Learning December, 2016 5 / 32
Introduction and Motivation Parallel Programming Models Machine Learning Techniques Comparison
CUDA, GPUs and Memory spaces
A GPU has many processors P,
all processors have the same clock
rate R and they are divided in
Multiprocessors.
A CUDA Kernel can be composed
of thousands and/or millions of
threads t.
Type On Chip Cacheable Instructions Visibility g Latency
Registers Yes No Load/Store Thread 1 cycle
Shared-L1 Yes No Load/Store Block 5 cycles
Constant No Yes Load Kernels 100 cycles
Texture No Yes Load/Store Kernel 100 cycles
Local No Yes Load/Store Thread 100 cycles
Cache L2 No Yes Load/Store Kernel 250 cycles
Global No Yes Load/Store Kernel 500 cycles
Table: Memory types in GPUs supported by CUDA
(gold, amaris)@ime.usp.br (IME - USP) BSP-based model Vs. Machine Learning December, 2016 6 / 32
Introduction and Motivation Parallel Programming Models Machine Learning Techniques Comparison
RoadMap of architectures of GPUs NVIDIA
In modern GPUs the comsumption of energy is a important restriction.
Projects of GPUs are generally highly scalable.
(gold, amaris)@ime.usp.br (IME - USP) BSP-based model Vs. Machine Learning December, 2016 7 / 32
Introduction and Motivation Parallel Programming Models Machine Learning Techniques Comparison
RoadMap of architectures of GPUs NVIDIA
Compute Capability is a diferentiation between architectures and models of
GPUs NVIDIA.
(gold, amaris)@ime.usp.br (IME - USP) BSP-based model Vs. Machine Learning December, 2016 8 / 32
Introduction and Motivation Parallel Programming Models Machine Learning Techniques Comparison
Compute Unified Device Architecture
CUDA - Compute Unified Device Architecture
CUDA is a extention of the language C, it allows to control the execution of grids
in a GPU and manages its memory.
(gold, amaris)@ime.usp.br (IME - USP) BSP-based model Vs. Machine Learning December, 2016 9 / 32
Introduction and Motivation Parallel Programming Models Machine Learning Techniques Comparison
GPU Programming Model
A GPU Aplication is organized in grids, blocks and threads. Threads are grouped
in blocks and they are grouped in a grid.
Linear translation to know the Id of a thread in a grid.
(gold, amaris)@ime.usp.br (IME - USP) BSP-based model Vs. Machine Learning December, 2016 10 / 32
Introduction and Motivation Parallel Programming Models Machine Learning Techniques Comparison
Top 500 Supercomputers
Intel Core i7 990X: 6 cores, US$ 1000 Theoretical maximum performance 0.4 TFLOP
GTX680: 1500 cores and 2GB, pre¸co US$500 Theoretical maximum performance 3.0 TFLOP
Accelerators and co-processors in the ranking top 500 Supercomputers more powerful of the world
(gold, amaris)@ime.usp.br (IME - USP) BSP-based model Vs. Machine Learning December, 2016 11 / 32
Introduction and Motivation Parallel Programming Models Machine Learning Techniques Comparison
Top 500 Green Supercomputers $$$$$$
Ranking of the supercomputers more efficient energetically in the world.
(gold, amaris)@ime.usp.br (IME - USP) BSP-based model Vs. Machine Learning December, 2016 12 / 32
Introduction and Motivation Parallel Programming Models Machine Learning Techniques Comparison
BSP-based model Vs. Machine Learning
1 Introduction and Motivation
2 Parallel Programming Models
3 Machine Learning Techniques
4 Comparison
(gold, amaris)@ime.usp.br (IME - USP) BSP-based model Vs. Machine Learning December, 2016 12 / 32
Introduction and Motivation Parallel Programming Models Machine Learning Techniques Comparison
Amdahl’s law and Flynn’s Taxonomy
Flynn’s Taxonomy - 1966
Single Instruction Multiple Instruction
Single Data SISD - Sequential MISD
Multiple Data SIMD [SIMT] - GPU MIMD - Multicore
Amdahl’s law - 1967
Amdahl’s law gives the theoretical speedup of the execution of a task at fixed
workload that can be expected of a system whose resources are improved.
Speedup:
S = Speed-up
P = Number of Processors
T = Time
Sp =
T1
Tp
(1)
(gold, amaris)@ime.usp.br (IME - USP) BSP-based model Vs. Machine Learning December, 2016 13 / 32
Introduction and Motivation Parallel Programming Models Machine Learning Techniques Comparison
Parallel Random Access Machine (PRAM)
Figure: PRAM Model
It ignores lower level architectural constraints, and details, such as memory
access contention and overhead, synchronization overhead, interconnection
network throughput, connectivity, speed limits and link bandwidths, etc.
(gold, amaris)@ime.usp.br (IME - USP) BSP-based model Vs. Machine Learning December, 2016 14 / 32
Introduction and Motivation Parallel Programming Models Machine Learning Techniques Comparison
Bulk Synchronous Parallel Model
Figure: Super-step in the BSP model
The cost to execute the i-th super-step is
then given by:
wi + ghi + L (2)
The total execution time of the applica-
tion is given by:
T = W + gH + LS (3)
(gold, amaris)@ime.usp.br (IME - USP) BSP-based model Vs. Machine Learning December, 2016 15 / 32
Introduction and Motivation Parallel Programming Models Machine Learning Techniques Comparison
Bulk Synchronous Parallel Model
Bulk Synchronous Parallel (BSP), introduced by
Valiant in 1990 Turing Award 2010.
High Level model for parallelism
Computation and communication of a Kernel
function
We did not include the synchronization step, nei-
ther communication with host memory
Optimization aspects are modeled by adjusting
a single parameter λ
Leslie Valiant
(gold, amaris)@ime.usp.br (IME - USP) BSP-based model Vs. Machine Learning December, 2016 16 / 32
Introduction and Motivation Parallel Programming Models Machine Learning Techniques Comparison
Analytical Model Published
Divergence, optimizations in the communication and differences between
architecture are adjusted by one parameter, λ1
Tk =
t · (Comp + CommSM + CommGM)
R · P · λ
(4)
CommGM = (ld1 + st1 − L1 − L2) · gGM + L1 · gL1 + L2 · gL2 (5)
CommSM = (ld0 + st0) · gSM (6)
comp, ld0, st0, ld1 and st1 are obtained on the source code.
L1 and L2 Cache hits are captured by profiling.
1
M. Amaris, D. Cordeiro, A. Goldman, and R. Y. Camargo, “A simple bsp-based model to
predict execution time in gpu applications,” in 22nd Int’l Conference on HPC, December 2015
(gold, amaris)@ime.usp.br (IME - USP) BSP-based model Vs. Machine Learning December, 2016 17 / 32
Introduction and Motivation Parallel Programming Models Machine Learning Techniques Comparison
BSP-based model Vs. Machine Learning
1 Introduction and Motivation
2 Parallel Programming Models
3 Machine Learning Techniques
4 Comparison
(gold, amaris)@ime.usp.br (IME - USP) BSP-based model Vs. Machine Learning December, 2016 17 / 32
Introduction and Motivation Parallel Programming Models Machine Learning Techniques Comparison
Machine Learning Techniques
The theoretical subject of “learning” is related to prediction.
Supervised Learning
Unsupervised Learning
3 different Machine Learning Techniques
Simple Linear Regression (LR)
Support Vector Machines (SVM)
Random Forest (RF)
In this work, we wanted to use simple models to prove that they achieve
reasonable predictions.
Fair comparison: (Data Input - Profile Information).
(gold, amaris)@ime.usp.br (IME - USP) BSP-based model Vs. Machine Learning December, 2016 18 / 32
Introduction and Motivation Parallel Programming Models Machine Learning Techniques Comparison
Linear Regression (LR)
It assumes that there is approximately a linear relationship between each Xp
and Y . Mathematically, we can write the multiple linear regression model
as
Y ≈ β0 + β1X1 + +β2X2 + . . . + +βpXp + (7)
where Xp represents the pth predictor and βp quantifies the association
between that variable and the response.
Figure: Example of a Linear Regression
(gold, amaris)@ime.usp.br (IME - USP) BSP-based model Vs. Machine Learning December, 2016 19 / 32
Introduction and Motivation Parallel Programming Models Machine Learning Techniques Comparison
Support Vector Machines (SVM)
SVM belongs to the general category of kernel methods, which are algo-
rithms that depend on the data only through dot-products. The dot product
can be replaced by a kernel function which computes a dot product in some
possibly high dimensional feature space Z. It maps the input vector x into
the feature space Z.
Figure: Example of Linear and no linear kernel for SVM in classification
(gold, amaris)@ime.usp.br (IME - USP) BSP-based model Vs. Machine Learning December, 2016 20 / 32
Introduction and Motivation Parallel Programming Models Machine Learning Techniques Comparison
Support Vector Machines (SVM)
SVM belongs to the general category of kernel methods, which are algo-
rithms that depend on the data only through dot-products. The dot product
can be replaced by a kernel function which computes a dot product in some
possibly high dimensional feature space Z. It maps the input vector x into
the feature space Z.
Figure: Example of Linear and no linear kernel for SVM in regression
(gold, amaris)@ime.usp.br (IME - USP) BSP-based model Vs. Machine Learning December, 2016 20 / 32
Introduction and Motivation Parallel Programming Models Machine Learning Techniques Comparison
Random Forest (RF)
Random Forests belong to decision tree methods, capable of performing
both regression and classification tasks.
Figure: Diagram of a tree decision
(gold, amaris)@ime.usp.br (IME - USP) BSP-based model Vs. Machine Learning December, 2016 21 / 32
Introduction and Motivation Parallel Programming Models Machine Learning Techniques Comparison
BSP-based model Vs. Machine Learning
1 Introduction and Motivation
2 Parallel Programming Models
3 Machine Learning Techniques
4 Comparison
(gold, amaris)@ime.usp.br (IME - USP) BSP-based model Vs. Machine Learning December, 2016 21 / 32
Introduction and Motivation Parallel Programming Models Machine Learning Techniques Comparison
GPUs of the Testbed
Model C.C. Memory Bus Bandwidth L2 Cores/SM Clock
GTX-680 3.0 2 GB 256-bit 192.2 GB/s 0.5 M 1536/8 1058 Mhz
Tesla-K40 3.5 12 GB 384-bit 276.5 GB/s 1.5 MB 2880/15 745 Mhz
Tesla-K20 3.5 4 GB 320-bit 200 GB/s 1 MB 2496/31 706 MHz
Titan Black 3.5 6 GB 384-bit 336 GB/s 1.5 MB 2880/15 980 Mhz
Titan 3.5 6 GB 384-bit 288.4 GB/s 1.5 MB 2688/14 876 Mhz
Quadro K5200 3.5 8 GB 256-bit 192.2 Gb/s 1 MB 2304/12 771 Mhz
Titan X 5.2 12 GB 384-bit 336.5 GB/s 3 MB 3072/24 1076 Mhz
GTX-980 5.2 4 GB 256-bit 224.3 GB/s 2 MB 2048/16 1216 Mhz
GTX-970 5.2 4 GB 256-bit 224.3 GB/s 1.75 MB 1664/13 1279 Mhz
Table: Hardware specifications of the GPUs in the testbed
(gold, amaris)@ime.usp.br (IME - USP) BSP-based model Vs. Machine Learning December, 2016 22 / 32
Introduction and Motivation Parallel Programming Models Machine Learning Techniques Comparison
Algorithm Testbed
9 different applications
Matrix Multiplications in 4 different optimizations:
* Global Memory - MMGU
* Global Memory with coalesced accesses - MMGC
* Global and Shared Memory - MMSU
* Global and shared Memory with coalesced accesses - MMSC
Matrix Addition in 2 different optimizations:
* Global Memory - MAU
* Global Memory with coalesced accesses - MAC
Dot Product - dotP
Vector Addition - vAdd
Maximum Subarray Problem - MSA
(gold, amaris)@ime.usp.br (IME - USP) BSP-based model Vs. Machine Learning December, 2016 23 / 32
Introduction and Motivation Parallel Programming Models Machine Learning Techniques Comparison
Dataset
10 Times each sample, with a confidence interval of 95%.
First Scenario - Machine Learning Vs Machine Learning
1st MMSC with Block size 42, 82, 122, 162, 202, 242, 282, and 322. 256 samples per GPU.
More 2000 Samples.
Second Scenario - Analytical Model Vs Machine Learning
Analytical Model
1D App. with input sizes from 218 until 227. 10 per GPU. 90 Samples.
2D App. with input sizes from 28 to 213. 6 per GPU. 54 Samples
Machine Learning - Block size 82, 162 and 322.
1D App. with input sizes from 218 to 227. 207 per GPU. 1863 Samples.
2D App. with input sizes from 28 to 213. 96 per GPU. 864 Samples
MSA Blocksize 128. 96 samples per GPU. 864 Samples.
(gold, amaris)@ime.usp.br (IME - USP) BSP-based model Vs. Machine Learning December, 2016 24 / 32
Introduction and Motivation Parallel Programming Models Machine Learning Techniques Comparison
Features of the Machine Learning Techniques
13 features were used to feed the Machine learning Techniques.
Feature Description
num of cores Number of cores per GPU
max clock rate GPU Max Clock rate
Bandwidth Theoretical Bandwidth
Input Size Size of the problem
totalLoadGM Load transaction in Global Memory
totalStoreGM Store transaction in Global Memory
TotalLoadSM Load transaction in Shared Memory
TotalStoreSM Store transaction in Global Memory
FLOPS SP Floating operation in Single Precision
BlockSize Number of threads per blocks
GridSize Number of blocks in the kernel
No. threads Number of threads in the applications
Achieved Occupancy
Ratio of the average active warps per active cycle to the maximum
number of warps ed on a multiprocessor.
(gold, amaris)@ime.usp.br (IME - USP) BSP-based model Vs. Machine Learning December, 2016 25 / 32
Introduction and Motivation Parallel Programming Models Machine Learning Techniques Comparison
Use Cases of the Analytical Model
Par.
Matrix Multiplication Matrix Addition
vAdd dotP MSA
MMGU MMGC MMSU MMSC MAU MAC
comp N· FMA 1 · 24 1 · 96 (N/t) · 100
ld1 2 · N 2 2 N/t
st1 1 1 1 5
ld0 0 2 · N 0 0 N/t
st0 0 1 0 1 + log(t) 5
q
q
q
q
q
0
10
20
30
40
50
60
70
80
90
100
110
120
130
MMGU MMGC MMSU MMSC MAU MAC dotP vAdd MSA
Applications
LambdaValues
Lambda Values of each one of the Applications
(gold, amaris)@ime.usp.br (IME - USP) BSP-based model Vs. Machine Learning December, 2016 26 / 32
Introduction and Motivation Parallel Programming Models Machine Learning Techniques Comparison
Log transformation
We first transformed the data to a log2 scale and, after performing the
learning and predictions, we returned to the original scale using a 2pred
transformation2, reducing the non-linearity effects.
Figure: Quantile-Quantile Analysis of the generated models
2
B. J. Barnes, et al. “A regression-based approach to scalability prediction,” in Proceedings
of the 22Nd Annual Int’l Conference on Supercomputing, ser. ICS ’08. New York, NY, USA.
(gold, amaris)@ime.usp.br (IME - USP) BSP-based model Vs. Machine Learning December, 2016 27 / 32
Introduction and Motivation Parallel Programming Models Machine Learning Techniques Comparison
Results Machine Learning - 1st Scenario
Tesla K40
Tesla K20
Quadro
Titan
TitanBlack
TitanX
GTX 680
GTX 980
GTX 970
●●●
●●
●
●
●●
●
●
●
●●●●●●
●●
●●
●●●●
●
●
●●●
●
●●
0.0
0.5
1.0
1.5
2.0
2.5
AccuracyTkTm
Linear Regression of MMSC
●●
●●●●●●●●●●●●●●
●●
●
●●
●●●●●●●●●●●●●●
●
●
●
●
0.0
0.5
1.0
1.5
2.0
2.5
AccuracyTkTm
Support Vector Machines of MMSC
●
●●
●●
●
●
●
●
●
●●
●●
●
●
●
●●●
●
●
●●●●●
●●●●
●●●
●●●●
0.0
0.5
1.0
1.5
2.0
2.5
AccuracyTkTm
Random Forest of MMSC
Figure: Accuracy of Machine Learning Algorithms of matMul-SM-Coalesced with
many samples
(gold, amaris)@ime.usp.br (IME - USP) BSP-based model Vs. Machine Learning December, 2016 28 / 32
Introduction and Motivation Parallel Programming Models Machine Learning Techniques Comparison
Results Machine Learning VS Analytical Model
Analytical LM RF SVM
0.0
0.5
1.0
1.5
2.0
2.5
0.0
0.5
1.0
1.5
2.0
2.5
0.0
0.5
1.0
1.5
2.0
2.5
0.0
0.5
1.0
1.5
2.0
2.5
1.0
1.5
2.0
2.5
MMGUMMGCMMSUMMSCMAU
uracyTkTm
Accuracy of the compared techniques
0.0
0.5
0.0
0.5
1.0
1.5
2.0
2.5
0.0
0.5
1.0
1.5
2.0
2.5
0.0
0.5
1.0
1.5
2.0
2.5
0.0
0.5
1.0
1.5
2.0
2.5
0.0
0.5
1.0
1.5
2.0
2.5
MAUMACdotPvAddMSA
AccuracyTkTm
G p u s
Tesla-K40 Tesla-K20 Quadro Titan TitanBlack TitanX GTX-680 GTX-980 GTX-970
(gold, amaris)@ime.usp.br (IME - USP) BSP-based model Vs. Machine Learning December, 2016 29 / 32
Introduction and Motivation Parallel Programming Models Machine Learning Techniques Comparison
Results Machine Learning VS Analytical Model
Analytical LM RF SVM
0.0
0.5
1.0
1.5
2.0
2.5
0.0
0.5
1.0
1.5
2.0
2.5
0.0
0.5
1.0
1.5
2.0
2.5
0.0
0.5
1.0
1.5
2.0
2.5
0.0
0.5
1.0
1.5
2.0
2.5
0.5
1.0
1.5
2.0
2.5
MMGUMMGCMMSUMMSCMAUMAC
AccuracyTkTm
Accuracy of the compared techniques
0.0
0.5
1.0
1.5
0.0
0.5
1.0
1.5
2.0
2.5
0.0
0.5
1.0
1.5
2.0
2.5
0.0
0.5
1.0
1.5
2.0
2.5
0.0
0.5
1.0
1.5
2.0
2.5
0.0
0.5
1.0
1.5
2.0
2.5
0.0
0.5
1.0
1.5
2.0
2.5
MMSUMMSCMAUMACdotPvAddMSA
AccuracyTkTm
G p u s
Tesla-K40 Tesla-K20 Quadro Titan TitanBlack TitanX GTX-680 GTX-980 GTX-970
(gold, amaris)@ime.usp.br (IME - USP) BSP-based model Vs. Machine Learning December, 2016 30 / 32
Introduction and Motivation Parallel Programming Models Machine Learning Techniques Comparison
Conclusions
Fair comparison.
Analytical model requires calculations
Machine learning provides more flexibility and generalization
Linear Regression can do reasonable predictions
But, ML requires a lot of label samples
(gold, amaris)@ime.usp.br (IME - USP) BSP-based model Vs. Machine Learning December, 2016 31 / 32
Introduction and Motivation Parallel Programming Models Machine Learning Techniques Comparison
Future Works
Irregular benchmarks (Rodinia, SHOC).
Multiple kernels our GPUS and global synchronization
One extra memory level, the CPU RAM.
Feature extraction.
(gold, amaris)@ime.usp.br (IME - USP) BSP-based model Vs. Machine Learning December, 2016 32 / 32
Introduction and Motivation Parallel Programming Models Machine Learning Techniques Comparison
Thanks for your attention
Repository of the work:
https://github.com/marcosamaris/svm-gpuperf
(gold, amaris)@ime.usp.br (IME - USP) BSP-based model Vs. Machine Learning December, 2016 32 / 32

Weitere ähnliche Inhalte

Andere mochten auch

Parallel programming
Parallel programmingParallel programming
Parallel programming
Swain Loda
 
J&J Thesis Presentation July 2016
J&J Thesis Presentation July 2016J&J Thesis Presentation July 2016
J&J Thesis Presentation July 2016
Michalis Avgoulis
 
thesis_jinxing_lin
thesis_jinxing_linthesis_jinxing_lin
thesis_jinxing_lin
jinxing lin
 
Introduction of Machine Learning
Introduction of Machine LearningIntroduction of Machine Learning
Introduction of Machine Learning
Mohammad Hossain
 
Random 111018004952-phpapp02-161130004551
Random 111018004952-phpapp02-161130004551Random 111018004952-phpapp02-161130004551
Random 111018004952-phpapp02-161130004551
noor basher
 

Andere mochten auch (19)

Ateji PX for Java
Ateji PX for JavaAteji PX for Java
Ateji PX for Java
 
Parallel programming
Parallel programmingParallel programming
Parallel programming
 
Transactional Memory
Transactional MemoryTransactional Memory
Transactional Memory
 
Concurrent/ parallel programming
Concurrent/ parallel programmingConcurrent/ parallel programming
Concurrent/ parallel programming
 
J&J Thesis Presentation July 2016
J&J Thesis Presentation July 2016J&J Thesis Presentation July 2016
J&J Thesis Presentation July 2016
 
Learning analytics and evidence-based teaching and learning
Learning analytics and evidence-based teaching and learningLearning analytics and evidence-based teaching and learning
Learning analytics and evidence-based teaching and learning
 
Delphi Parallel Programming Library
Delphi Parallel Programming LibraryDelphi Parallel Programming Library
Delphi Parallel Programming Library
 
Presentation Teaching Evidence-Based Management NYU Wagner 2014
Presentation Teaching Evidence-Based Management NYU Wagner 2014Presentation Teaching Evidence-Based Management NYU Wagner 2014
Presentation Teaching Evidence-Based Management NYU Wagner 2014
 
Concurrency & Parallel Programming
Concurrency & Parallel ProgrammingConcurrency & Parallel Programming
Concurrency & Parallel Programming
 
Machine learning for deep learning
Machine learning for deep learningMachine learning for deep learning
Machine learning for deep learning
 
CV2015. Лекция 4. Классификация изображений и введение в машинное обучение.
CV2015. Лекция 4. Классификация изображений и введение в машинное обучение.CV2015. Лекция 4. Классификация изображений и введение в машинное обучение.
CV2015. Лекция 4. Классификация изображений и введение в машинное обучение.
 
CV2015. Лекция 7. Поиск изображений по содержанию.
CV2015. Лекция 7. Поиск изображений по содержанию.CV2015. Лекция 7. Поиск изображений по содержанию.
CV2015. Лекция 7. Поиск изображений по содержанию.
 
Технологии разработки ПО
Технологии разработки ПОТехнологии разработки ПО
Технологии разработки ПО
 
thesis_jinxing_lin
thesis_jinxing_linthesis_jinxing_lin
thesis_jinxing_lin
 
Introduction of Machine Learning
Introduction of Machine LearningIntroduction of Machine Learning
Introduction of Machine Learning
 
Plant Integration and MES Solution for Industry
Plant Integration and MES Solution for IndustryPlant Integration and MES Solution for Industry
Plant Integration and MES Solution for Industry
 
Junli Gu at AI Frontiers: Autonomous Driving Revolution
Junli Gu at AI Frontiers: Autonomous Driving RevolutionJunli Gu at AI Frontiers: Autonomous Driving Revolution
Junli Gu at AI Frontiers: Autonomous Driving Revolution
 
Random 111018004952-phpapp02-161130004551
Random 111018004952-phpapp02-161130004551Random 111018004952-phpapp02-161130004551
Random 111018004952-phpapp02-161130004551
 
Machine Learning - Introduction
Machine Learning - IntroductionMachine Learning - Introduction
Machine Learning - Introduction
 

Ähnlich wie SlidesA Comparison of GPU Execution Time Prediction using Machine Learning and Analytical Modeling

CPU vs. GPU presentation
CPU vs. GPU presentationCPU vs. GPU presentation
CPU vs. GPU presentation
Vishal Singh
 
Evolutionary Optimization Using Big Data from Engineering Simulations and Apa...
Evolutionary Optimization Using Big Data from Engineering Simulations and Apa...Evolutionary Optimization Using Big Data from Engineering Simulations and Apa...
Evolutionary Optimization Using Big Data from Engineering Simulations and Apa...
International Journal of Modern Research in Engineering and Technology
 
GPU_based Searching
GPU_based SearchingGPU_based Searching
GPU_based Searching
jpawan33
 
P5 verification
P5 verificationP5 verification
P5 verification
dragonvnu
 
Resume-Rohit_Vijay_Bapat_December_2016
Resume-Rohit_Vijay_Bapat_December_2016Resume-Rohit_Vijay_Bapat_December_2016
Resume-Rohit_Vijay_Bapat_December_2016
Rohit Bapat
 
Multiclass classification using Massively Threaded Multiprocessors
Multiclass classification using Massively Threaded MultiprocessorsMulticlass classification using Massively Threaded Multiprocessors
Multiclass classification using Massively Threaded Multiprocessors
sergherrero
 

Ähnlich wie SlidesA Comparison of GPU Execution Time Prediction using Machine Learning and Analytical Modeling (20)

CPU vs. GPU presentation
CPU vs. GPU presentationCPU vs. GPU presentation
CPU vs. GPU presentation
 
Machine learning at scale challenges and solutions
Machine learning at scale challenges and solutionsMachine learning at scale challenges and solutions
Machine learning at scale challenges and solutions
 
Evolutionary Optimization Using Big Data from Engineering Simulations and Apa...
Evolutionary Optimization Using Big Data from Engineering Simulations and Apa...Evolutionary Optimization Using Big Data from Engineering Simulations and Apa...
Evolutionary Optimization Using Big Data from Engineering Simulations and Apa...
 
Understanding Parallelization of Machine Learning Algorithms in Apache Spark™
Understanding Parallelization of Machine Learning Algorithms in Apache Spark™Understanding Parallelization of Machine Learning Algorithms in Apache Spark™
Understanding Parallelization of Machine Learning Algorithms in Apache Spark™
 
GPU_based Searching
GPU_based SearchingGPU_based Searching
GPU_based Searching
 
RTL DESIGN IN ML WORLD_OBJECT AUTOMATION Inc
RTL DESIGN IN ML WORLD_OBJECT AUTOMATION IncRTL DESIGN IN ML WORLD_OBJECT AUTOMATION Inc
RTL DESIGN IN ML WORLD_OBJECT AUTOMATION Inc
 
NGD Systems and Microsoft Keynote Presentation at IPDPS MPP in Vacouver
NGD Systems and Microsoft Keynote Presentation at IPDPS MPP in VacouverNGD Systems and Microsoft Keynote Presentation at IPDPS MPP in Vacouver
NGD Systems and Microsoft Keynote Presentation at IPDPS MPP in Vacouver
 
Pattern -A scoring engine
Pattern -A scoring enginePattern -A scoring engine
Pattern -A scoring engine
 
AVOIDING DUPLICATED COMPUTATION TO IMPROVE THE PERFORMANCE OF PFSP ON CUDA GPUS
AVOIDING DUPLICATED COMPUTATION TO IMPROVE THE PERFORMANCE OF PFSP ON CUDA GPUSAVOIDING DUPLICATED COMPUTATION TO IMPROVE THE PERFORMANCE OF PFSP ON CUDA GPUS
AVOIDING DUPLICATED COMPUTATION TO IMPROVE THE PERFORMANCE OF PFSP ON CUDA GPUS
 
AVOIDING DUPLICATED COMPUTATION TO IMPROVE THE PERFORMANCE OF PFSP ON CUDA GPUS
AVOIDING DUPLICATED COMPUTATION TO IMPROVE THE PERFORMANCE OF PFSP ON CUDA GPUSAVOIDING DUPLICATED COMPUTATION TO IMPROVE THE PERFORMANCE OF PFSP ON CUDA GPUS
AVOIDING DUPLICATED COMPUTATION TO IMPROVE THE PERFORMANCE OF PFSP ON CUDA GPUS
 
HSAemu a Full System Emulator for HSA
HSAemu a Full System Emulator for HSA HSAemu a Full System Emulator for HSA
HSAemu a Full System Emulator for HSA
 
P5 verification
P5 verificationP5 verification
P5 verification
 
Resume-Rohit_Vijay_Bapat_December_2016
Resume-Rohit_Vijay_Bapat_December_2016Resume-Rohit_Vijay_Bapat_December_2016
Resume-Rohit_Vijay_Bapat_December_2016
 
ArangoML Pipeline Cloud - Managed Machine Learning Metadata
ArangoML Pipeline Cloud - Managed Machine Learning MetadataArangoML Pipeline Cloud - Managed Machine Learning Metadata
ArangoML Pipeline Cloud - Managed Machine Learning Metadata
 
Multiclass classification using Massively Threaded Multiprocessors
Multiclass classification using Massively Threaded MultiprocessorsMulticlass classification using Massively Threaded Multiprocessors
Multiclass classification using Massively Threaded Multiprocessors
 
Introduction to theano, case study of Word Embeddings
Introduction to theano, case study of Word EmbeddingsIntroduction to theano, case study of Word Embeddings
Introduction to theano, case study of Word Embeddings
 
Is Multicore Hardware For General-Purpose Parallel Processing Broken? : Notes
Is Multicore Hardware For General-Purpose Parallel Processing Broken? : NotesIs Multicore Hardware For General-Purpose Parallel Processing Broken? : Notes
Is Multicore Hardware For General-Purpose Parallel Processing Broken? : Notes
 
BSSML16 L8. REST API, Bindings, and Basic Workflows
BSSML16 L8. REST API, Bindings, and Basic WorkflowsBSSML16 L8. REST API, Bindings, and Basic Workflows
BSSML16 L8. REST API, Bindings, and Basic Workflows
 
Performance Characterization and Optimization of In-Memory Data Analytics on ...
Performance Characterization and Optimization of In-Memory Data Analytics on ...Performance Characterization and Optimization of In-Memory Data Analytics on ...
Performance Characterization and Optimization of In-Memory Data Analytics on ...
 
MLSEV Virtual. ML Platformization and AutoML in the Enterprise
MLSEV Virtual. ML Platformization and AutoML in the EnterpriseMLSEV Virtual. ML Platformization and AutoML in the Enterprise
MLSEV Virtual. ML Platformization and AutoML in the Enterprise
 

Mehr von Marcos Gonzalez

Mehr von Marcos Gonzalez (9)

Slides experimentation softengineering
Slides experimentation softengineeringSlides experimentation softengineering
Slides experimentation softengineering
 
Science 2.0: A opening Science
Science 2.0: A opening ScienceScience 2.0: A opening Science
Science 2.0: A opening Science
 
Classification of ECG of AIM using Compression-based Dissimilarity-Measure
Classification of ECG of AIM using Compression-based Dissimilarity-Measure Classification of ECG of AIM using Compression-based Dissimilarity-Measure
Classification of ECG of AIM using Compression-based Dissimilarity-Measure
 
Intrapreneurship
IntrapreneurshipIntrapreneurship
Intrapreneurship
 
A menina-do-vale Emprendedorismo
A menina-do-vale EmprendedorismoA menina-do-vale Emprendedorismo
A menina-do-vale Emprendedorismo
 
Fractional Fourier Transform: Fractional Wiener Filter in Scilab
Fractional Fourier Transform: Fractional Wiener Filter in ScilabFractional Fourier Transform: Fractional Wiener Filter in Scilab
Fractional Fourier Transform: Fractional Wiener Filter in Scilab
 
Filtros wavelet para Electrocardiogramas en R
Filtros wavelet para Electrocardiogramas en RFiltros wavelet para Electrocardiogramas en R
Filtros wavelet para Electrocardiogramas en R
 
MPI and Distributed Applications
MPI and Distributed ApplicationsMPI and Distributed Applications
MPI and Distributed Applications
 
GPU, CUDA, OpenCL and OpenACC for Parallel Applications
GPU, CUDA, OpenCL and OpenACC for Parallel ApplicationsGPU, CUDA, OpenCL and OpenACC for Parallel Applications
GPU, CUDA, OpenCL and OpenACC for Parallel Applications
 

Kürzlich hochgeladen

VIP Call Girls Ankleshwar 7001035870 Whatsapp Number, 24/07 Booking
VIP Call Girls Ankleshwar 7001035870 Whatsapp Number, 24/07 BookingVIP Call Girls Ankleshwar 7001035870 Whatsapp Number, 24/07 Booking
VIP Call Girls Ankleshwar 7001035870 Whatsapp Number, 24/07 Booking
dharasingh5698
 
UNIT-V FMM.HYDRAULIC TURBINE - Construction and working
UNIT-V FMM.HYDRAULIC TURBINE - Construction and workingUNIT-V FMM.HYDRAULIC TURBINE - Construction and working
UNIT-V FMM.HYDRAULIC TURBINE - Construction and working
rknatarajan
 

Kürzlich hochgeladen (20)

Introduction to IEEE STANDARDS and its different types.pptx
Introduction to IEEE STANDARDS and its different types.pptxIntroduction to IEEE STANDARDS and its different types.pptx
Introduction to IEEE STANDARDS and its different types.pptx
 
UNIT-III FMM. DIMENSIONAL ANALYSIS
UNIT-III FMM.        DIMENSIONAL ANALYSISUNIT-III FMM.        DIMENSIONAL ANALYSIS
UNIT-III FMM. DIMENSIONAL ANALYSIS
 
Glass Ceramics: Processing and Properties
Glass Ceramics: Processing and PropertiesGlass Ceramics: Processing and Properties
Glass Ceramics: Processing and Properties
 
ONLINE FOOD ORDER SYSTEM PROJECT REPORT.pdf
ONLINE FOOD ORDER SYSTEM PROJECT REPORT.pdfONLINE FOOD ORDER SYSTEM PROJECT REPORT.pdf
ONLINE FOOD ORDER SYSTEM PROJECT REPORT.pdf
 
Introduction and different types of Ethernet.pptx
Introduction and different types of Ethernet.pptxIntroduction and different types of Ethernet.pptx
Introduction and different types of Ethernet.pptx
 
VIP Call Girls Ankleshwar 7001035870 Whatsapp Number, 24/07 Booking
VIP Call Girls Ankleshwar 7001035870 Whatsapp Number, 24/07 BookingVIP Call Girls Ankleshwar 7001035870 Whatsapp Number, 24/07 Booking
VIP Call Girls Ankleshwar 7001035870 Whatsapp Number, 24/07 Booking
 
MANUFACTURING PROCESS-II UNIT-2 LATHE MACHINE
MANUFACTURING PROCESS-II UNIT-2 LATHE MACHINEMANUFACTURING PROCESS-II UNIT-2 LATHE MACHINE
MANUFACTURING PROCESS-II UNIT-2 LATHE MACHINE
 
UNIT-V FMM.HYDRAULIC TURBINE - Construction and working
UNIT-V FMM.HYDRAULIC TURBINE - Construction and workingUNIT-V FMM.HYDRAULIC TURBINE - Construction and working
UNIT-V FMM.HYDRAULIC TURBINE - Construction and working
 
KubeKraft presentation @CloudNativeHooghly
KubeKraft presentation @CloudNativeHooghlyKubeKraft presentation @CloudNativeHooghly
KubeKraft presentation @CloudNativeHooghly
 
CCS335 _ Neural Networks and Deep Learning Laboratory_Lab Complete Record
CCS335 _ Neural Networks and Deep Learning Laboratory_Lab Complete RecordCCS335 _ Neural Networks and Deep Learning Laboratory_Lab Complete Record
CCS335 _ Neural Networks and Deep Learning Laboratory_Lab Complete Record
 
Call Girls in Nagpur Suman Call 7001035870 Meet With Nagpur Escorts
Call Girls in Nagpur Suman Call 7001035870 Meet With Nagpur EscortsCall Girls in Nagpur Suman Call 7001035870 Meet With Nagpur Escorts
Call Girls in Nagpur Suman Call 7001035870 Meet With Nagpur Escorts
 
Call Girls Pimpri Chinchwad Call Me 7737669865 Budget Friendly No Advance Boo...
Call Girls Pimpri Chinchwad Call Me 7737669865 Budget Friendly No Advance Boo...Call Girls Pimpri Chinchwad Call Me 7737669865 Budget Friendly No Advance Boo...
Call Girls Pimpri Chinchwad Call Me 7737669865 Budget Friendly No Advance Boo...
 
University management System project report..pdf
University management System project report..pdfUniversity management System project report..pdf
University management System project report..pdf
 
MANUFACTURING PROCESS-II UNIT-5 NC MACHINE TOOLS
MANUFACTURING PROCESS-II UNIT-5 NC MACHINE TOOLSMANUFACTURING PROCESS-II UNIT-5 NC MACHINE TOOLS
MANUFACTURING PROCESS-II UNIT-5 NC MACHINE TOOLS
 
The Most Attractive Pune Call Girls Manchar 8250192130 Will You Miss This Cha...
The Most Attractive Pune Call Girls Manchar 8250192130 Will You Miss This Cha...The Most Attractive Pune Call Girls Manchar 8250192130 Will You Miss This Cha...
The Most Attractive Pune Call Girls Manchar 8250192130 Will You Miss This Cha...
 
BSides Seattle 2024 - Stopping Ethan Hunt From Taking Your Data.pptx
BSides Seattle 2024 - Stopping Ethan Hunt From Taking Your Data.pptxBSides Seattle 2024 - Stopping Ethan Hunt From Taking Your Data.pptx
BSides Seattle 2024 - Stopping Ethan Hunt From Taking Your Data.pptx
 
Porous Ceramics seminar and technical writing
Porous Ceramics seminar and technical writingPorous Ceramics seminar and technical writing
Porous Ceramics seminar and technical writing
 
Call Girls Service Nashik Vaishnavi 7001305949 Independent Escort Service Nashik
Call Girls Service Nashik Vaishnavi 7001305949 Independent Escort Service NashikCall Girls Service Nashik Vaishnavi 7001305949 Independent Escort Service Nashik
Call Girls Service Nashik Vaishnavi 7001305949 Independent Escort Service Nashik
 
Sheet Pile Wall Design and Construction: A Practical Guide for Civil Engineer...
Sheet Pile Wall Design and Construction: A Practical Guide for Civil Engineer...Sheet Pile Wall Design and Construction: A Practical Guide for Civil Engineer...
Sheet Pile Wall Design and Construction: A Practical Guide for Civil Engineer...
 
(ANJALI) Dange Chowk Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(ANJALI) Dange Chowk Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...(ANJALI) Dange Chowk Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(ANJALI) Dange Chowk Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
 

SlidesA Comparison of GPU Execution Time Prediction using Machine Learning and Analytical Modeling

  • 1. Introduction and Motivation Parallel Programming Models Machine Learning Techniques Comparison A Comparison of GPU Execution Time Prediction using Machine Learning and Analytical Modeling Ph.D(c) CS Marcos Amar´ıs Gonz´alez Advisor: Dr. Alfredo Goldman vel Lejbman Co-advisor: Dr. Raphael Yokoingawa de Camargo December, 2016 (gold, amaris)@ime.usp.br (IME - USP) BSP-based model Vs. Machine Learning December, 2016 1 / 32
  • 2. Introduction and Motivation Parallel Programming Models Machine Learning Techniques Comparison Timeline 1 Introduction and Motivation 2 Parallel Programming Models BSP-based Analytical Model for GPUs 3 Machine Learning Techniques 4 Comparison Methodology Results Conclusions and Future Works (gold, amaris)@ime.usp.br (IME - USP) BSP-based model Vs. Machine Learning December, 2016 1 / 32
  • 3. Introduction and Motivation Parallel Programming Models Machine Learning Techniques Comparison BSP-based model Vs. Machine Learning 1 Introduction and Motivation 2 Parallel Programming Models 3 Machine Learning Techniques 4 Comparison (gold, amaris)@ime.usp.br (IME - USP) BSP-based model Vs. Machine Learning December, 2016 1 / 32
  • 4. Introduction and Motivation Parallel Programming Models Machine Learning Techniques Comparison Games and Video Cards 80’ - First video driver Evolution of the games 3D. It is nec- essary to apply textures, lights, shad- ows, reflections, etc. It was also necessary more computing power For this, the video cards became to be more flexible and powerful (gold, amaris)@ime.usp.br (IME - USP) BSP-based model Vs. Machine Learning December, 2016 2 / 32
  • 5. Introduction and Motivation Parallel Programming Models Machine Learning Techniques Comparison Graphic Processing Units - GPUs The term GPU was popularized by Nvidia in 1999, who invented a GeForce 256 like the first GPU in the world. In 2002 the first General Purpose GPU was launched. The term GPGPU was created by Mark Harris. The main manufacturer of GPUs are NVIDIA and AMD. In 2005 NVIDIA launched CUDA. Deep Learning, Virtual Reality. (gold, amaris)@ime.usp.br (IME - USP) BSP-based model Vs. Machine Learning December, 2016 3 / 32
  • 6. Introduction and Motivation Parallel Programming Models Machine Learning Techniques Comparison General Purpose GPU - GPGPU Main program execute in the CPU (host) and it is responsible to start the execution in the GPU (device). These GPUs have their own hierarchy of memory and data must be transfered through the PCI Express. (gold, amaris)@ime.usp.br (IME - USP) BSP-based model Vs. Machine Learning December, 2016 4 / 32
  • 7. Introduction and Motivation Parallel Programming Models Machine Learning Techniques Comparison GPU Versus CPU Nowadays GPUs are capable to perform much more efficient computing operations than CPUs multicores. (gold, amaris)@ime.usp.br (IME - USP) BSP-based model Vs. Machine Learning December, 2016 5 / 32
  • 8. Introduction and Motivation Parallel Programming Models Machine Learning Techniques Comparison CUDA, GPUs and Memory spaces A GPU has many processors P, all processors have the same clock rate R and they are divided in Multiprocessors. A CUDA Kernel can be composed of thousands and/or millions of threads t. Type On Chip Cacheable Instructions Visibility g Latency Registers Yes No Load/Store Thread 1 cycle Shared-L1 Yes No Load/Store Block 5 cycles Constant No Yes Load Kernels 100 cycles Texture No Yes Load/Store Kernel 100 cycles Local No Yes Load/Store Thread 100 cycles Cache L2 No Yes Load/Store Kernel 250 cycles Global No Yes Load/Store Kernel 500 cycles Table: Memory types in GPUs supported by CUDA (gold, amaris)@ime.usp.br (IME - USP) BSP-based model Vs. Machine Learning December, 2016 6 / 32
  • 9. Introduction and Motivation Parallel Programming Models Machine Learning Techniques Comparison RoadMap of architectures of GPUs NVIDIA In modern GPUs the comsumption of energy is a important restriction. Projects of GPUs are generally highly scalable. (gold, amaris)@ime.usp.br (IME - USP) BSP-based model Vs. Machine Learning December, 2016 7 / 32
  • 10. Introduction and Motivation Parallel Programming Models Machine Learning Techniques Comparison RoadMap of architectures of GPUs NVIDIA Compute Capability is a diferentiation between architectures and models of GPUs NVIDIA. (gold, amaris)@ime.usp.br (IME - USP) BSP-based model Vs. Machine Learning December, 2016 8 / 32
  • 11. Introduction and Motivation Parallel Programming Models Machine Learning Techniques Comparison Compute Unified Device Architecture CUDA - Compute Unified Device Architecture CUDA is a extention of the language C, it allows to control the execution of grids in a GPU and manages its memory. (gold, amaris)@ime.usp.br (IME - USP) BSP-based model Vs. Machine Learning December, 2016 9 / 32
  • 12. Introduction and Motivation Parallel Programming Models Machine Learning Techniques Comparison GPU Programming Model A GPU Aplication is organized in grids, blocks and threads. Threads are grouped in blocks and they are grouped in a grid. Linear translation to know the Id of a thread in a grid. (gold, amaris)@ime.usp.br (IME - USP) BSP-based model Vs. Machine Learning December, 2016 10 / 32
  • 13. Introduction and Motivation Parallel Programming Models Machine Learning Techniques Comparison Top 500 Supercomputers Intel Core i7 990X: 6 cores, US$ 1000 Theoretical maximum performance 0.4 TFLOP GTX680: 1500 cores and 2GB, pre¸co US$500 Theoretical maximum performance 3.0 TFLOP Accelerators and co-processors in the ranking top 500 Supercomputers more powerful of the world (gold, amaris)@ime.usp.br (IME - USP) BSP-based model Vs. Machine Learning December, 2016 11 / 32
  • 14. Introduction and Motivation Parallel Programming Models Machine Learning Techniques Comparison Top 500 Green Supercomputers $$$$$$ Ranking of the supercomputers more efficient energetically in the world. (gold, amaris)@ime.usp.br (IME - USP) BSP-based model Vs. Machine Learning December, 2016 12 / 32
  • 15. Introduction and Motivation Parallel Programming Models Machine Learning Techniques Comparison BSP-based model Vs. Machine Learning 1 Introduction and Motivation 2 Parallel Programming Models 3 Machine Learning Techniques 4 Comparison (gold, amaris)@ime.usp.br (IME - USP) BSP-based model Vs. Machine Learning December, 2016 12 / 32
  • 16. Introduction and Motivation Parallel Programming Models Machine Learning Techniques Comparison Amdahl’s law and Flynn’s Taxonomy Flynn’s Taxonomy - 1966 Single Instruction Multiple Instruction Single Data SISD - Sequential MISD Multiple Data SIMD [SIMT] - GPU MIMD - Multicore Amdahl’s law - 1967 Amdahl’s law gives the theoretical speedup of the execution of a task at fixed workload that can be expected of a system whose resources are improved. Speedup: S = Speed-up P = Number of Processors T = Time Sp = T1 Tp (1) (gold, amaris)@ime.usp.br (IME - USP) BSP-based model Vs. Machine Learning December, 2016 13 / 32
  • 17. Introduction and Motivation Parallel Programming Models Machine Learning Techniques Comparison Parallel Random Access Machine (PRAM) Figure: PRAM Model It ignores lower level architectural constraints, and details, such as memory access contention and overhead, synchronization overhead, interconnection network throughput, connectivity, speed limits and link bandwidths, etc. (gold, amaris)@ime.usp.br (IME - USP) BSP-based model Vs. Machine Learning December, 2016 14 / 32
  • 18. Introduction and Motivation Parallel Programming Models Machine Learning Techniques Comparison Bulk Synchronous Parallel Model Figure: Super-step in the BSP model The cost to execute the i-th super-step is then given by: wi + ghi + L (2) The total execution time of the applica- tion is given by: T = W + gH + LS (3) (gold, amaris)@ime.usp.br (IME - USP) BSP-based model Vs. Machine Learning December, 2016 15 / 32
  • 19. Introduction and Motivation Parallel Programming Models Machine Learning Techniques Comparison Bulk Synchronous Parallel Model Bulk Synchronous Parallel (BSP), introduced by Valiant in 1990 Turing Award 2010. High Level model for parallelism Computation and communication of a Kernel function We did not include the synchronization step, nei- ther communication with host memory Optimization aspects are modeled by adjusting a single parameter λ Leslie Valiant (gold, amaris)@ime.usp.br (IME - USP) BSP-based model Vs. Machine Learning December, 2016 16 / 32
  • 20. Introduction and Motivation Parallel Programming Models Machine Learning Techniques Comparison Analytical Model Published Divergence, optimizations in the communication and differences between architecture are adjusted by one parameter, λ1 Tk = t · (Comp + CommSM + CommGM) R · P · λ (4) CommGM = (ld1 + st1 − L1 − L2) · gGM + L1 · gL1 + L2 · gL2 (5) CommSM = (ld0 + st0) · gSM (6) comp, ld0, st0, ld1 and st1 are obtained on the source code. L1 and L2 Cache hits are captured by profiling. 1 M. Amaris, D. Cordeiro, A. Goldman, and R. Y. Camargo, “A simple bsp-based model to predict execution time in gpu applications,” in 22nd Int’l Conference on HPC, December 2015 (gold, amaris)@ime.usp.br (IME - USP) BSP-based model Vs. Machine Learning December, 2016 17 / 32
  • 21. Introduction and Motivation Parallel Programming Models Machine Learning Techniques Comparison BSP-based model Vs. Machine Learning 1 Introduction and Motivation 2 Parallel Programming Models 3 Machine Learning Techniques 4 Comparison (gold, amaris)@ime.usp.br (IME - USP) BSP-based model Vs. Machine Learning December, 2016 17 / 32
  • 22. Introduction and Motivation Parallel Programming Models Machine Learning Techniques Comparison Machine Learning Techniques The theoretical subject of “learning” is related to prediction. Supervised Learning Unsupervised Learning 3 different Machine Learning Techniques Simple Linear Regression (LR) Support Vector Machines (SVM) Random Forest (RF) In this work, we wanted to use simple models to prove that they achieve reasonable predictions. Fair comparison: (Data Input - Profile Information). (gold, amaris)@ime.usp.br (IME - USP) BSP-based model Vs. Machine Learning December, 2016 18 / 32
  • 23. Introduction and Motivation Parallel Programming Models Machine Learning Techniques Comparison Linear Regression (LR) It assumes that there is approximately a linear relationship between each Xp and Y . Mathematically, we can write the multiple linear regression model as Y ≈ β0 + β1X1 + +β2X2 + . . . + +βpXp + (7) where Xp represents the pth predictor and βp quantifies the association between that variable and the response. Figure: Example of a Linear Regression (gold, amaris)@ime.usp.br (IME - USP) BSP-based model Vs. Machine Learning December, 2016 19 / 32
  • 24. Introduction and Motivation Parallel Programming Models Machine Learning Techniques Comparison Support Vector Machines (SVM) SVM belongs to the general category of kernel methods, which are algo- rithms that depend on the data only through dot-products. The dot product can be replaced by a kernel function which computes a dot product in some possibly high dimensional feature space Z. It maps the input vector x into the feature space Z. Figure: Example of Linear and no linear kernel for SVM in classification (gold, amaris)@ime.usp.br (IME - USP) BSP-based model Vs. Machine Learning December, 2016 20 / 32
  • 25. Introduction and Motivation Parallel Programming Models Machine Learning Techniques Comparison Support Vector Machines (SVM) SVM belongs to the general category of kernel methods, which are algo- rithms that depend on the data only through dot-products. The dot product can be replaced by a kernel function which computes a dot product in some possibly high dimensional feature space Z. It maps the input vector x into the feature space Z. Figure: Example of Linear and no linear kernel for SVM in regression (gold, amaris)@ime.usp.br (IME - USP) BSP-based model Vs. Machine Learning December, 2016 20 / 32
  • 26. Introduction and Motivation Parallel Programming Models Machine Learning Techniques Comparison Random Forest (RF) Random Forests belong to decision tree methods, capable of performing both regression and classification tasks. Figure: Diagram of a tree decision (gold, amaris)@ime.usp.br (IME - USP) BSP-based model Vs. Machine Learning December, 2016 21 / 32
  • 27. Introduction and Motivation Parallel Programming Models Machine Learning Techniques Comparison BSP-based model Vs. Machine Learning 1 Introduction and Motivation 2 Parallel Programming Models 3 Machine Learning Techniques 4 Comparison (gold, amaris)@ime.usp.br (IME - USP) BSP-based model Vs. Machine Learning December, 2016 21 / 32
  • 28. Introduction and Motivation Parallel Programming Models Machine Learning Techniques Comparison GPUs of the Testbed Model C.C. Memory Bus Bandwidth L2 Cores/SM Clock GTX-680 3.0 2 GB 256-bit 192.2 GB/s 0.5 M 1536/8 1058 Mhz Tesla-K40 3.5 12 GB 384-bit 276.5 GB/s 1.5 MB 2880/15 745 Mhz Tesla-K20 3.5 4 GB 320-bit 200 GB/s 1 MB 2496/31 706 MHz Titan Black 3.5 6 GB 384-bit 336 GB/s 1.5 MB 2880/15 980 Mhz Titan 3.5 6 GB 384-bit 288.4 GB/s 1.5 MB 2688/14 876 Mhz Quadro K5200 3.5 8 GB 256-bit 192.2 Gb/s 1 MB 2304/12 771 Mhz Titan X 5.2 12 GB 384-bit 336.5 GB/s 3 MB 3072/24 1076 Mhz GTX-980 5.2 4 GB 256-bit 224.3 GB/s 2 MB 2048/16 1216 Mhz GTX-970 5.2 4 GB 256-bit 224.3 GB/s 1.75 MB 1664/13 1279 Mhz Table: Hardware specifications of the GPUs in the testbed (gold, amaris)@ime.usp.br (IME - USP) BSP-based model Vs. Machine Learning December, 2016 22 / 32
  • 29. Introduction and Motivation Parallel Programming Models Machine Learning Techniques Comparison Algorithm Testbed 9 different applications Matrix Multiplications in 4 different optimizations: * Global Memory - MMGU * Global Memory with coalesced accesses - MMGC * Global and Shared Memory - MMSU * Global and shared Memory with coalesced accesses - MMSC Matrix Addition in 2 different optimizations: * Global Memory - MAU * Global Memory with coalesced accesses - MAC Dot Product - dotP Vector Addition - vAdd Maximum Subarray Problem - MSA (gold, amaris)@ime.usp.br (IME - USP) BSP-based model Vs. Machine Learning December, 2016 23 / 32
  • 30. Introduction and Motivation Parallel Programming Models Machine Learning Techniques Comparison Dataset 10 Times each sample, with a confidence interval of 95%. First Scenario - Machine Learning Vs Machine Learning 1st MMSC with Block size 42, 82, 122, 162, 202, 242, 282, and 322. 256 samples per GPU. More 2000 Samples. Second Scenario - Analytical Model Vs Machine Learning Analytical Model 1D App. with input sizes from 218 until 227. 10 per GPU. 90 Samples. 2D App. with input sizes from 28 to 213. 6 per GPU. 54 Samples Machine Learning - Block size 82, 162 and 322. 1D App. with input sizes from 218 to 227. 207 per GPU. 1863 Samples. 2D App. with input sizes from 28 to 213. 96 per GPU. 864 Samples MSA Blocksize 128. 96 samples per GPU. 864 Samples. (gold, amaris)@ime.usp.br (IME - USP) BSP-based model Vs. Machine Learning December, 2016 24 / 32
  • 31. Introduction and Motivation Parallel Programming Models Machine Learning Techniques Comparison Features of the Machine Learning Techniques 13 features were used to feed the Machine learning Techniques. Feature Description num of cores Number of cores per GPU max clock rate GPU Max Clock rate Bandwidth Theoretical Bandwidth Input Size Size of the problem totalLoadGM Load transaction in Global Memory totalStoreGM Store transaction in Global Memory TotalLoadSM Load transaction in Shared Memory TotalStoreSM Store transaction in Global Memory FLOPS SP Floating operation in Single Precision BlockSize Number of threads per blocks GridSize Number of blocks in the kernel No. threads Number of threads in the applications Achieved Occupancy Ratio of the average active warps per active cycle to the maximum number of warps ed on a multiprocessor. (gold, amaris)@ime.usp.br (IME - USP) BSP-based model Vs. Machine Learning December, 2016 25 / 32
  • 32. Introduction and Motivation Parallel Programming Models Machine Learning Techniques Comparison Use Cases of the Analytical Model Par. Matrix Multiplication Matrix Addition vAdd dotP MSA MMGU MMGC MMSU MMSC MAU MAC comp N· FMA 1 · 24 1 · 96 (N/t) · 100 ld1 2 · N 2 2 N/t st1 1 1 1 5 ld0 0 2 · N 0 0 N/t st0 0 1 0 1 + log(t) 5 q q q q q 0 10 20 30 40 50 60 70 80 90 100 110 120 130 MMGU MMGC MMSU MMSC MAU MAC dotP vAdd MSA Applications LambdaValues Lambda Values of each one of the Applications (gold, amaris)@ime.usp.br (IME - USP) BSP-based model Vs. Machine Learning December, 2016 26 / 32
  • 33. Introduction and Motivation Parallel Programming Models Machine Learning Techniques Comparison Log transformation We first transformed the data to a log2 scale and, after performing the learning and predictions, we returned to the original scale using a 2pred transformation2, reducing the non-linearity effects. Figure: Quantile-Quantile Analysis of the generated models 2 B. J. Barnes, et al. “A regression-based approach to scalability prediction,” in Proceedings of the 22Nd Annual Int’l Conference on Supercomputing, ser. ICS ’08. New York, NY, USA. (gold, amaris)@ime.usp.br (IME - USP) BSP-based model Vs. Machine Learning December, 2016 27 / 32
  • 34. Introduction and Motivation Parallel Programming Models Machine Learning Techniques Comparison Results Machine Learning - 1st Scenario Tesla K40 Tesla K20 Quadro Titan TitanBlack TitanX GTX 680 GTX 980 GTX 970 ●●● ●● ● ● ●● ● ● ● ●●●●●● ●● ●● ●●●● ● ● ●●● ● ●● 0.0 0.5 1.0 1.5 2.0 2.5 AccuracyTkTm Linear Regression of MMSC ●● ●●●●●●●●●●●●●● ●● ● ●● ●●●●●●●●●●●●●● ● ● ● ● 0.0 0.5 1.0 1.5 2.0 2.5 AccuracyTkTm Support Vector Machines of MMSC ● ●● ●● ● ● ● ● ● ●● ●● ● ● ● ●●● ● ● ●●●●● ●●●● ●●● ●●●● 0.0 0.5 1.0 1.5 2.0 2.5 AccuracyTkTm Random Forest of MMSC Figure: Accuracy of Machine Learning Algorithms of matMul-SM-Coalesced with many samples (gold, amaris)@ime.usp.br (IME - USP) BSP-based model Vs. Machine Learning December, 2016 28 / 32
  • 35. Introduction and Motivation Parallel Programming Models Machine Learning Techniques Comparison Results Machine Learning VS Analytical Model Analytical LM RF SVM 0.0 0.5 1.0 1.5 2.0 2.5 0.0 0.5 1.0 1.5 2.0 2.5 0.0 0.5 1.0 1.5 2.0 2.5 0.0 0.5 1.0 1.5 2.0 2.5 1.0 1.5 2.0 2.5 MMGUMMGCMMSUMMSCMAU uracyTkTm Accuracy of the compared techniques 0.0 0.5 0.0 0.5 1.0 1.5 2.0 2.5 0.0 0.5 1.0 1.5 2.0 2.5 0.0 0.5 1.0 1.5 2.0 2.5 0.0 0.5 1.0 1.5 2.0 2.5 0.0 0.5 1.0 1.5 2.0 2.5 MAUMACdotPvAddMSA AccuracyTkTm G p u s Tesla-K40 Tesla-K20 Quadro Titan TitanBlack TitanX GTX-680 GTX-980 GTX-970 (gold, amaris)@ime.usp.br (IME - USP) BSP-based model Vs. Machine Learning December, 2016 29 / 32
  • 36. Introduction and Motivation Parallel Programming Models Machine Learning Techniques Comparison Results Machine Learning VS Analytical Model Analytical LM RF SVM 0.0 0.5 1.0 1.5 2.0 2.5 0.0 0.5 1.0 1.5 2.0 2.5 0.0 0.5 1.0 1.5 2.0 2.5 0.0 0.5 1.0 1.5 2.0 2.5 0.0 0.5 1.0 1.5 2.0 2.5 0.5 1.0 1.5 2.0 2.5 MMGUMMGCMMSUMMSCMAUMAC AccuracyTkTm Accuracy of the compared techniques 0.0 0.5 1.0 1.5 0.0 0.5 1.0 1.5 2.0 2.5 0.0 0.5 1.0 1.5 2.0 2.5 0.0 0.5 1.0 1.5 2.0 2.5 0.0 0.5 1.0 1.5 2.0 2.5 0.0 0.5 1.0 1.5 2.0 2.5 0.0 0.5 1.0 1.5 2.0 2.5 MMSUMMSCMAUMACdotPvAddMSA AccuracyTkTm G p u s Tesla-K40 Tesla-K20 Quadro Titan TitanBlack TitanX GTX-680 GTX-980 GTX-970 (gold, amaris)@ime.usp.br (IME - USP) BSP-based model Vs. Machine Learning December, 2016 30 / 32
  • 37. Introduction and Motivation Parallel Programming Models Machine Learning Techniques Comparison Conclusions Fair comparison. Analytical model requires calculations Machine learning provides more flexibility and generalization Linear Regression can do reasonable predictions But, ML requires a lot of label samples (gold, amaris)@ime.usp.br (IME - USP) BSP-based model Vs. Machine Learning December, 2016 31 / 32
  • 38. Introduction and Motivation Parallel Programming Models Machine Learning Techniques Comparison Future Works Irregular benchmarks (Rodinia, SHOC). Multiple kernels our GPUS and global synchronization One extra memory level, the CPU RAM. Feature extraction. (gold, amaris)@ime.usp.br (IME - USP) BSP-based model Vs. Machine Learning December, 2016 32 / 32
  • 39. Introduction and Motivation Parallel Programming Models Machine Learning Techniques Comparison Thanks for your attention Repository of the work: https://github.com/marcosamaris/svm-gpuperf (gold, amaris)@ime.usp.br (IME - USP) BSP-based model Vs. Machine Learning December, 2016 32 / 32