Scolari's ICCD17 Talk

Speaker: Alberto Scolari, PhD student @ Politecnico di Milano, Italy - alberto.scolari@polimi.it
ICCD, 7 November 2017
Alberto Scolari, Yunseong Lee, Markus Weimer, Matteo
Interlandi
Towards Accelerating Generic Machine Learning
Prediction Pipelines

Accelerating ML prediction
• Machine Learning models are composed as Direct
Acyclic Graphs (DAG) of transformations in all major
toolkits: TensorFlow, Scikit, Microsoft Internal ML Tool
(IMLT)
• How to accelerate generic prediction DAGs?
– I.e. how to operationalize models?
• We need a systemic approach to accelerate
prediction DAGs
• DAGs can be split into stages, and optimized per stage
– Separation of model representation from execution
2

3
Related works
• Related works focus on single operators
• Neural Networks are the most common target
– they have great predicting capabilities
– are usually the slowest operators by far
• Relevant works:
– Microsoft Brainwave [1]
– Google TPU [2]
– Qualcomm NPE [3]
– Decision Trees and Random Forests on FPGA [4]
[1] https://www.microsoft.com/en-us/research/blog/microsoft-unveils-project-brainwave/
[2] In-Datacenter Performance Analysis of a Tensor Processing Unit,
arXiv, Apr 2017
[3] https://developer.qualcomm.com/software/snapdragon-neural-processing-engine
[4] Owaida, Muhsen, et al. "Scalable inference of decision tree ensembles: Flexible design for CPU-FPGA platforms." Field Programmable
Logic and Applications (FPL), 2017

• No work covers entire DAGs
• We used production-like models
– E.g. sentiment-analysis DAG based on linear regression
4
A case study pipeline
en composed by sequences of transformations. While this design makes easy to
mponents at training time, predictions requires low latency and high performance
ptimizations and acceleration is needed to meet such goals. This paper shed some light on
l, and showing how by redesigning model pipelines for efficient execution over CPUs and
al folds can be achieved.
oring, Prediction Pipelines, FPGA.
F
ch as Google Ten-
[5] or Microsoft’s
ure the sequence
f transformations,
dimensional input
rediction whereby
s is used to featur-
oring.
rating specific ML
x neural networks
rch exists on the
al ML prediction
processing, which
ase, still occurs on
nitial steps (for ex-
Input OutputTokenizer
Word
Ngram
Char
Ngram
Concat
Linear
Regression
Fig. 1. A model pipeline for sentiment analysis.
buffers for intermediate transformations. This largely de-
creases the memory footprint and the pressure on garbage
collection, and avoids relying on external dependencies (like
NumPY [7] for Scikit) for efficient memory management.
Conversely, this design forces memory allocation along the
Fig. 2. Execution breakdown of the example model
2 THE CASE FOR ACCELERATING GENERIC PRE-
DICTION PIPELINES
wher
very
(0.3%
matio
anoth
whic
summ
the d
work
many
Execution time %
Which operator should we accelerate?

• Our DAG is written in IMLT, in C#
• IMLT lazy initialization of DAG state: first
run is 540x slower than others
– Need to keep model in memory for
acceptable performance
• Not always feasible, so two running
scenarios
– cold: the model is not initialized in memory
– hot: the model is completely initialized in
memory
5
Case study investigation

CPU 1
• Many of them can be better achieved by grouping operators in
stages
– Contain one or more operators
– Basic unit of scheduling and execution
– Borrowed from DB world
– Different stages can possibly run on different devices
6
Our approach: stages
• Looking at IMLT, we saw large room for optimizations:
– Operators fusion
– Buffer sharing
– Low level optimisations: inlining, vectorization, …
CPU 0
FPGA CPU 2

• We reimplemented the model in three stages
1.Tokenization and character N-gram
• Tokenization and character counting in same loop
2.Word N-gram consists in a dictionary lookup of each
word, and the results is added to the array of characters
N-grams
3.Linear Regression classification is a sparse vector dot
product between N-grams counts and weights
7
Model reimplementation
CN
T C
WN
LinReg
Stage 1 Stage 3
Stage 2
a pipelined fas
can be split in
ing some overl
tokenizer units
punctuation m
emit a correspo
achieve an efﬁ
State Machine
take in input o

• CPU implementation running on Intel Xeon CPU E5-2620
– Still C#, but no complexities like reflection, virtual calls, …
– Upfront buffer allocation and sharing within stage
• FPGA implementation running on ADM KU3 with Xilinx Kintex
UltraScale
– Implemented with Xilinx Vivado HLS and SDAccel
– Thoroughly limited by RAM accesses, no out-of-order nor multiple
outstanding memory requests
8
Experimental results
Fig. 4. Performance improvement achieved by CPU and FPGA imple-
put while achi
lose the chance
Tensorﬂow
els in Tensorﬂo
prediction requ
pipelines is mo
Servable for th
the use of hard
there is no fram
over the pipeli
tation.

• We have devised a methodology to accelerate
prediction DAGs and better operationalize them
– Considers multiple hardware targets
• Open problems:
– How to identify stages automatically?
– How to optimize them automatically?
• Future work:
– Library of ML operators for FPGA
– Sharing operators state in addition to computation
9
Achievements and future work
Towards Accelerating Generic Machine Learning Prediction Pipelines
Alberto Scolari, Yunseong Lee, Markus Weimer, Matteo Interlandi
Speaker: Alberto Scolari - alberto.scolari@polimi.it

Scolari's ICCD17 Talk

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Was ist angesagt?

Was ist angesagt? (20)

Ähnlich wie Scolari's ICCD17 Talk

Ähnlich wie Scolari's ICCD17 Talk (20)

Mehr von NECST Lab @ Politecnico di Milano

Mehr von NECST Lab @ Politecnico di Milano (20)

Kürzlich hochgeladen

Kürzlich hochgeladen (20)

Scolari's ICCD17 Talk