SlideShare ist ein Scribd-Unternehmen logo
1 von 25
Downloaden Sie, um offline zu lesen
Speaker: Alberto Scolari, PhD student @ Politecnico di Milano, Italy - alberto.scolari@polimi.it
Oracle Labs, Belmont, 05/30/2018
Yunseong Lee, Alberto Scolari, Matteo Interlandi, Markus
Weimer, Marco D. Santambrogio, Byung-Gon Chun
PRETZEL: Opening the Black Box
of Machine Learning Prediction Serving Systems
ML-as-a-Service
• ML models are deployed on cloud platforms as black
boxes
• Users deploy multiple models per machine (10-100s)
2
• Deployed models are often similar
- Similar structure
- Similar state
• Two key requirements:
1. Performance: latency or throughput
2. Model density: number of models per
machine
good:0
bad:1
…
This is a good
product
good:0
bad:1
…
This is a good
product
We need to know structure and state:
white-box model
3
Breaking the black-box model
1. To generate an optimised version of a model on
deployment: higher performance
2. To allocate shared state only once, and share among
models: higher density
‣ Opening the black box
‣ Pretzel, white-box serving system
‣ Evaluation
‣ Conclusions and future work
4
Outline
OPENING THE BLACK BOX
5
6
Case study
• 300 models in ML.NET [1], C#, from Amazon Reviews dataset
[2]: 250 Sentiment Analysis + 50 Regression Task
• Analyzed and accelerated in previous work [3]
• Most of the time is spent in featurization
• No single bottleneck
PRETZEL: Opening the Black Box of Machine Learning
Prediction Serving Systems
Paper # 355
Abstract
Machine Learning models are often composed of
pipelines of transformations. While this design allows to
efficiently execute single model components at training-
time, prediction serving has different requirements such
as low latency, high throughput and graceful performance
degradation under heavy load. Current prediction serv-
ing systems consider models as black boxes, whereby
prediction-time-specific optimizations are ignored in fa-
vor of ease of deployment. In this paper, we present
PRETZEL, a prediction serving system introducing a
novel white box architecture enabling both end-to-end
and multi-model optimizations. Using production-like
model pipelines, our experiments show that PRETZEL is
able to introduce performance improvements over differ-
ent dimensions; compared to state-of-the-art approaches
PRETZEL is able to reduce latency by up to 100⇥ while
reducing memory footprint by up to 7⇥, and increasing
throughput by up to 2.5⇥.
1 Introduction
Many Machine Learning (ML) frameworks such as
Google TensorFlow [2], Facebook Caffe2 [4], Scikit-
learn [41], or CompanyX’s Internal ML Library (ML2)
allow data scientists to declaratively author pipelines of
transformations to train models from large-scale input
datasets. Model pipelines are internally represented as
Directed Acyclic Graphs (DAGs) of operators comprising
data transformations and featurizers (e.g., string tokeniza-
tion, hashing, etc.), and ML models (e.g., decision trees,
linear models, SVMs, etc.). Figure 1 shows an example
pipeline for text analysis whereby input sentences are
classified according to the expressed sentiment.
Machine Learning is usually conceptualized as a two-
steps process: first, during training model parameters are
estimated from large datasets by running computation-
ally intensive iterative algorithms; successively, trained
pipelines are used for inference to generate predictions
through the estimated model parameters. When trained
“This is a nice product”
Positive vs. Negative
Tokenizer
Char
Ngram
Word
Ngram
Concat
Linear
Regression
Figure 1: A sentimental analysis pipeline consisting of
operators for featurization (ellipses), followed by a ML
model (diamond). Tokenizer extracts tokens (e.g., words)
from the input string. Char and Word Ngrams featurize
input tokens by extracting n-grams. Concat generates a
unique feature vector which is then scored by a Linear
Regression predictor. This is a simplification: the actual
DAG contains about 12 operators.
pipelines are served for inference, the full set of operators
is deployed altogether.However, pipelines have different
system characteristics based on the phase in which they
are employed: for instance, at training time ML models
run complex algorithms to scale over large datasets (e.g.,
linear models can use gradient descent in one of its many
flavors [43, 42, 44]), while, once trained, they behave as
other regular featurizers and data transformations; further-
more, during inference pipelines are often surfaced for
direct users’ servicing and therefore require low latency,
high throughput, and graceful degradation of performance
in case of load spikes.
Existing prediction serving systems, such as Clip-
per [7, 25], TensorFlow Serving [3, 39], Rafiki [49], ML2
itself, and others [13, 14, 36, 12] focus mainly on ease
of deployment, where pipelines are considered as black
boxes and deployed into containers (e.g., Docker [9] in
Clipper and Rafiki, servables in TensorFlow Serving).
Under this strategy, only “pipeline-agnostic” optimiza-
tions such as caching, batching and buffering are avail-
able. Nevertheless, we found that black box approaches
fell short on several aspects. For instance, prediction ser-
1
Fig. 2. Execution breakdown of the example model
2 THE CASE FOR ACCELERATING GENERIC PRE-
DICTION PIPELINES
Nowadays, several “intelligent” services such as Microsoft
Cortana speech recognition, Netflix movie recommender or
Gmail spam detector depend on ML model scoring capabil-
ities, and are currently experiencing a growing demand that
in turn fosters the research on their acceleration.
While efforts exist that try to accelerate specific types of
models (e.g., Brainwave for deep neural networks) many
models we actually see in production are often generic
and composed of several (tens of) different transformations.
Many workloads run in a prediction-serving, cloud-like
setting, where prediction models coming from data science
experts are operationalized. While data scientists prefer to
use high-level declarative tools such as IMLT for better
productivity, operationalized models require low latency,
high throughput, and highly predictable performance. By
separating the platform where models are authored (and
trained) from the one where models are operationalized, we
can make simplifying assumptions on the latter:
• models are pre-trained and do not change during
execution; 1
• models from the same user likely share several
transformations, or, in case such transformations are
stateful (and immutable), they likely share part of
the state; in some cases (for example a tokenization
process on a common English vocabulary) the state
sharing may hold even beyond single users
where the prediction (the LinReg transformation
very small amount of time compared to other com
(0.3% of the execution time). Instead, the n-gram
mations take up almost 60% of the execution tim
another 33% is spent in concatenating the feature
which consists in allocating memory and copying
summary, most of the execution time is lost in p
the data for scoring, not in computing the predictio
workloads, we observed that this situation is co
many models, because many tasks employ simpl
tion models like linear regression functions, both b
their prediction latency, and because of their simp
understandability. In these scenarios, we can conc
the acceleration of an ML pipeline cannot focus
specific transformations, as in the case of neural
where the bulk of computation is spent in scoring, b
a more systematic approach.
3 CONSIDERATIONS AND SYSTEM IMPLE
TION
Following the observations of Section 2, we ar
acceleration of generic ML prediction pipelines
ble if we optimize models end-to-end instead
transformations, i.e., while data scientist focus
level “logical” transformations, the operationalizat
model for scoring should focus on execution unit
optimal decisions on how data is processed throug
pipelines. We call such execution units stages, borro
term from the database and system communities
stage is a sequence of one or more transformation
executed together on a device (either a CPU or a
in this work), and represents the “atomic” unit of
putation. No data movement from the computing
another occurs within a stage, while communica
occur between two stages running on different dev
the PCIe communication when offloading to FPGA
[2] R. He and J. McAuley. Ups and downs: Modeling the visual evolution of fashion trends with one-class collaborative filtering. In Proceedings of
the 25th International Conference on World Wide Web, WWW ’16
[3] A. Scolari, Y. Lee, M. Weimer and M. Interlandi, "Towards Accelerating Generic Machine Learning Prediction Pipelines," 2017 IEEE International
Conference on Computer Design (ICCD)
• Overheads are a remarkable % of runtime
- JIT, GC, virtual calls, …
• Operators have different buffers and break
locality
• Operators are executed one at a time: no code
fusion, multiple data accesses
7
Limitations of black-box(a) We identify the operators by their parameters. The first two groups represent N-gram
operators, which have multiple versions with different length (e.g., unigram, trigram)
(b) 3 different ML models are use
each with distinct parameters.
Figure 3: Probability for an operator within the 250 different pipelines. If an operator x has the probability y%, the
250⇥ y
100 pipelines have operator x, implying that pipelines can save memory by sharing one instance of operator x
Figure 4: CDF of latency of prediction requests of 250
DAGs. We denote the first prediction as cold; the hot
line is reported as average over 100 predictions after a
warm-up period of 10. The plot is normalized over the
99th percentile latency of the hot case.
describes this situation, where the performance of hot
predictions over the 250 sentiment analysis pipelines with
memory already allocated and JIT-compiled code is more
than an order-of-magnitude faster then the cold version
for the same pipelines. To drill down more into the prob-
lem, we found that 57.4% of the total execution time for a
single cold prediction is spent in pipeline analysis and ini-
tialization of the function chain, 36.5% in JIT compilation
and the remaining is actual computation time.
Infrequent Accesses: In order to meet milliseconds-level
latencies [51], model pipelines have to reside in main
memory (possibly already warmed-up), since they can
have MBs to GBs (compressed) size on disk, with loading
and initialization times easily exceeding several seconds.
A common practice in production settings is to unload
a pipeline if not accessed after a certain period of time
(e.g., a few hours). Once evicted, successive accesses will
incur a model loading penalty and warming-up, therefore
violating Service Level Agreement (SLA).
Operator-at-a-time Model: As previously described,
predictions over ML2 pipelines are computed by pullin
records through a sequence of operators, each of the
transforming to the input vector(s) and producing on
or more new vectors. While (as is common practice f
in-memory data-intensive systems [38, 48, 18]) some i
terpretation overheads are eliminated via JIT compilatio
operators in ML2 (and in other tools) are “logical” entiti
(e.g., linear regression, tokenizer, one-hot encoder, etc
with diverse performance characteristics. Figure 5 show
the latency breakdown of one execution of the sentime
analysis pipeline of Figure 1, where the only ML oper
tor (linear regression) takes two orders-of-magnitude le
time with respect to the slowest operator (WordNgram
It is common practice for in-memory data-intensive sy
tems to pipeline operators in order to minimize memo
accesses for memory-intensive workloads, and to ve
torize compute intensive operators in order to minimiz
the number of instructions per data item [26, 56]. ML
operator-at-a-time model [56] (as other libraries missin
an optimization layer, such as Scikit-learn) is therefo
sub-optimal in that computation is organized around lo
ical operators, ignoring how those operators behave t
gether: in the example of the sentiment analysis pipelin
at hand, linear regression is commutative and associativ
(e.g., dot product between vectors) and can be pipeline
with Char and WordNgram, eliminating the need for th
Concat operation. As we will see in following Section
PRETZEL optimizer is able to detect this situation and
generate an execution plan that is several times faster wi
regard to ML2 unoptimized version of the pipeline.
Coarse Grained Scheduling: Scheduling CPU r
sources carefully is essential to serve highly concurre
requests and run machines to maximum utilization. Und
the black box approach: (1) a thread pool is used to serv
multiple concurrent requests to the same model pipelin
(2) for each request, one thread handles the execution
4
(a) We identify the operators by their parameters. The first two groups represent N-gram
operators, which have multiple versions with different length (e.g., unigram, trigram)
(b) 3 different ML models are used,
each with distinct parameters.
Figure 3: Probability for an operator within the 250 different pipelines. If an operator x has the probability y%, then
250⇥ y
100 pipelines have operator x, implying that pipelines can save memory by sharing one instance of operator x.
Figure 4: CDF of latency of prediction requests of 250
DAGs. We denote the first prediction as cold; the hot
line is reported as average over 100 predictions after a
warm-up period of 10. The plot is normalized over the
99th percentile latency of the hot case.
describes this situation, where the performance of hot
predictions over the 250 sentiment analysis pipelines with
memory already allocated and JIT-compiled code is more
than an order-of-magnitude faster then the cold version
for the same pipelines. To drill down more into the prob-
lem, we found that 57.4% of the total execution time for a
single cold prediction is spent in pipeline analysis and ini-
tialization of the function chain, 36.5% in JIT compilation
and the remaining is actual computation time.
Infrequent Accesses: In order to meet milliseconds-level
latencies [51], model pipelines have to reside in main
memory (possibly already warmed-up), since they can
have MBs to GBs (compressed) size on disk, with loading
and initialization times easily exceeding several seconds.
A common practice in production settings is to unload
a pipeline if not accessed after a certain period of time
(e.g., a few hours). Once evicted, successive accesses will
incur a model loading penalty and warming-up, therefore
violating Service Level Agreement (SLA).
Operator-at-a-time Model: As previously described,
predictions over ML2 pipelines are computed by pulling
records through a sequence of operators, each of them
transforming to the input vector(s) and producing one
or more new vectors. While (as is common practice for
in-memory data-intensive systems [38, 48, 18]) some in-
terpretation overheads are eliminated via JIT compilation,
operators in ML2 (and in other tools) are “logical” entities
(e.g., linear regression, tokenizer, one-hot encoder, etc.)
with diverse performance characteristics. Figure 5 shows
the latency breakdown of one execution of the sentiment
analysis pipeline of Figure 1, where the only ML opera-
tor (linear regression) takes two orders-of-magnitude less
time with respect to the slowest operator (WordNgram).
It is common practice for in-memory data-intensive sys-
tems to pipeline operators in order to minimize memory
accesses for memory-intensive workloads, and to vec-
torize compute intensive operators in order to minimize
the number of instructions per data item [26, 56]. ML2
operator-at-a-time model [56] (as other libraries missing
an optimization layer, such as Scikit-learn) is therefore
sub-optimal in that computation is organized around log-
ical operators, ignoring how those operators behave to-
gether: in the example of the sentiment analysis pipeline
at hand, linear regression is commutative and associative
(e.g., dot product between vectors) and can be pipelined
with Char and WordNgram, eliminating the need for the
Concat operation. As we will see in following Sections,
PRETZEL optimizer is able to detect this situation and to
generate an execution plan that is several times faster with
regard to ML2 unoptimized version of the pipeline.
Coarse Grained Scheduling: Scheduling CPU re-
sources carefully is essential to serve highly concurrent
requests and run machines to maximum utilization. Under
the black box approach: (1) a thread pool is used to serve
multiple concurrent requests to the same model pipeline;
(2) for each request, one thread handles the execution of
• No state sharing: state is
duplicated -> memory is wasted
Limited performance
Limited density
• Optimisations for single operators, like DNNs [4-5]
• TensorFlow Serving [6] as Servable Python objects
• ML.NET as zip files with state files and DLLs
• Clipper [7] and Rafiki [8] deploy pipelines as Docker containers
– They schedule requests based on latency target
– Can apply caching and batching
• MauveDB [9] accepts regression and interpolation model and
optimises them as DB views
8
Related work
[4] https://www.microsoft.com/en-us/research/blog/microsoft-unveils-project-brainwave/
[5] In-Datacenter Performance Analysis of a Tensor Processing Unit, arXiv, Apr 2017
[6] C.Olston, et al., and V. Rajashekhar. Tensorflow - serving: Flexible, high-performance ml serving. In Work-shop on ML Systems at NIPS, 2017
[7] D. Crankshaw, et al.. Clipper: A low-latency online prediction serving system. In NSDI, 2017
[8] W.Wang,S.Wang,J.Gao, et al.. Rafiki: Machine Learning as an Analytics Service System. ArXiv e-prints, Apr. 2018
[9] A. Deshpande and S. Madden. Mauvedb: Supporting model-based user views in database systems. In SIGMOD, 2006
PRETZEL, WHITE-BOX SERVING SYSTEM
9
White Box Prediction Serving: make pipelines co-exist
better, schedule better
1. End-to-end Optimizations: merge operators in
computational units, to decrease overheads
2. Multi-model Optimizations: create once, use
everywhere, for both data and stages
10
Design principles
SELECT * FROM Model2 WHERE …
JOIN WORD-NGRAM
11
Afterthought: scoring as querying
good:0
bad:1
…
This is a good
product
WORD-NGRAM
“This is a good product”
12
Off-line phase - Flour+Ovenages the dynamic decisions on how to schedule plans
based on machine workload. Finally, a FrontEnd is used
to submit prediction requests to the system.
Model pipeline deployment and serving in PRETZEL
follow a two-phase process. During the off-line phase
(Section 4.1), ML2’s pre-trained pipelines are translated
into Flour transformation-based API. Oven optimizer re-
arranges and fuses transformations into model plans com-
posed of parameterized logical units called stages. Each
logical stage is then AOT compiled into physical computa-
tion units where memory resources and threads are pooled
at runtime. Model plans are registered for prediction serv-
ing in the Runtime where physical stages and parameters
are shared between pipelines with similar model plans. In
the on-line phase (Section 4.2), when an inference request
for a registered model plan is received, physical stages are
parameterized dynamically with the proper values main-
tained in the Object Store. The Scheduler is in charge of
binding physical stages to shared execution units.
Figures 6 and 7 pictorially summarize the above de-
scriptions; note that only the on-line phase is executed
at inference time, whereas the model plans are generated
completely off-line. Next, we will describe each layer
composing the PRETZEL prediction system.
4.1 Off-line Phase
4.1.1 Flour
The goal of Flour is to provide an intermediate represen-
tation between ML frameworks (currently only ML2) and
PRETZEL that is both easy to target and amenable to op-
timizations. Once a pipeline is ported into Flour, it can
be optimized and compiled (Section 4.1.2) into a model
plan before getting fed into PRETZEL Runtime for on-
line scoring. Flour is a language-integrated API similar
to KeystoneML [45], RDDs [53] or LINQ [35] where
sequences of transformations are chained into DAGs and
lazily compiled for execution.
Listing 1 shows how the sentiment analysis pipeline of
Figure 1 can be expressed in Flour. Flour programs are
composed by transformations where a one-to-many map-
ping exists between ML2 operators and Flour transforma-
tions (i.e., one operator in ML2 can be mapped to many
transformations in Flour). Each Flour program starts from
a FlourContext object wrapping the Object Store.
Subsequent method calls define a DAG of transforma-
tions, which will end with a call to plan to instantiate
the model plan before feeding it into PRETZEL Runtime.
For example, in line 3 of Figure 1 the CSV.FromText
call is used to specify that the target DAG accepts as in-
put text in csv format where fields and fieldsType
arrays indicate the number and type of input fields. The
successive call to Tokenize in line 4 splits the input
fields into tokens. Lines 6 and 7 contain the two branches
defining the char-level and word-level n-gram transforma-
tions, which are then merged with the Concat transform
in line 9 before the linear binary classifier of line 10. Both
char and word n-gram transformations are parametrized
by the number of n-grams and maps translating n-grams
into numerical format (not shown in the Figure). Addi-
tionally, each Flour transformation accepts as input an
optional set of statistics gathered from training. These
statistics are used by the compiler to generate physical
plans more efficiently tailored to the model characteristics.
Example statistics are max vector size (to define the mini-
mum size of vectors to fetch from the pool at prediction
time 4.2), dense or sparse representations, etc.
Listing 1: Flour program for the sentiment analysis
pipeline. Transformations’ parameters are extracted from
the original ML2 pipeline.
1 var fContext = new FlourContext(objectStore, ...)
2 var tTokenizer = fContext.CSV.
3 FromText(fields, fieldsType,
sep).
4 Tokenize();
5
6 var tCNgram = tTokenizer.CharNgram(numCNgrms,
...);
7 var tWNgram = tTokenizer.WordNgram(numWNgrms,
...);
8 var fPrgrm = tCNgram.
9 Concat(tWNgram).
10 ClassifierBinaryLinear(cParams);
11
12 return fPrgrm.Plan();
We have instrumented the ML2 library to collect statis-
tics from training and with the related bindings to the
Object Store and Flour to automatically extract Flour
programs from pipelines once trained.
4.1.2 Oven
With Oven, our goal is to bring query compilation and
optimization techniques into ML2.
Optimizer: When plan is called on a Flour transforma-
tion’s reference (e.g., fProgram in line 12 of Figure 1),
all transformations leading to it are wrapped and analyzed.
Oven follows the typical rule-based database optimizer de-
sign where operator graphs (query plans) are transformed
by a set of rules until a fixed point is reached (i.e., the
graph does not change after the application of any rule).
The goal of Oven Optimizer is to transform an input graph
of Flour transformations into a stage graph, where each
stage contains one or more transformations. To group
transformations into stages we used the Tupleware’s hy-
brid approach [26]: memory-intensive transformations
6
PRETZEL: Opening the Black Box of Machine Learning
Prediction Serving Systems
Paper # 355
Abstract
Machine Learning models are often composed of
pipelines of transformations. While this design allows to
efficiently execute single model components at training-
time, prediction serving has different requirements such
as low latency, high throughput and graceful performance
degradation under heavy load. Current prediction serv-
ing systems consider models as black boxes, whereby
prediction-time-specific optimizations are ignored in fa-
vor of ease of deployment. In this paper, we present
PRETZEL, a prediction serving system introducing a
novel white box architecture enabling both end-to-end
and multi-model optimizations. Using production-like
model pipelines, our experiments show that PRETZEL is
able to introduce performance improvements over differ-
ent dimensions; compared to state-of-the-art approaches
PRETZEL is able to reduce latency by up to 100⇥ while
reducing memory footprint by up to 7⇥, and increasing
throughput by up to 2.5⇥.
1 Introduction
Many Machine Learning (ML) frameworks such as
Google TensorFlow [2], Facebook Caffe2 [4], Scikit-
learn [41], or CompanyX’s Internal ML Library (ML2)
allow data scientists to declaratively author pipelines of
transformations to train models from large-scale input
datasets. Model pipelines are internally represented as
Directed Acyclic Graphs (DAGs) of operators comprising
data transformations and featurizers (e.g., string tokeniza-
tion, hashing, etc.), and ML models (e.g., decision trees,
linear models, SVMs, etc.). Figure 1 shows an example
pipeline for text analysis whereby input sentences are
classified according to the expressed sentiment.
Machine Learning is usually conceptualized as a two-
steps process: first, during training model parameters are
estimated from large datasets by running computation-
ally intensive iterative algorithms; successively, trained
pipelines are used for inference to generate predictions
through the estimated model parameters. When trained
“This is a nice product”
Positive vs. Negative
Tokenizer
Char
Ngram
Word
Ngram
Concat
Linear
Regression
Figure 1: A sentimental analysis pipeline consisting of
operators for featurization (ellipses), followed by a ML
model (diamond). Tokenizer extracts tokens (e.g., words)
from the input string. Char and Word Ngrams featurize
input tokens by extracting n-grams. Concat generates a
unique feature vector which is then scored by a Linear
Regression predictor. This is a simplification: the actual
DAG contains about 12 operators.
pipelines are served for inference, the full set of operators
is deployed altogether.However, pipelines have different
system characteristics based on the phase in which they
are employed: for instance, at training time ML models
run complex algorithms to scale over large datasets (e.g.,
linear models can use gradient descent in one of its many
flavors [43, 42, 44]), while, once trained, they behave as
other regular featurizers and data transformations; further-
more, during inference pipelines are often surfaced for
direct users’ servicing and therefore require low latency,
high throughput, and graceful degradation of performance
in case of load spikes.
Existing prediction serving systems, such as Clip-
per [7, 25], TensorFlow Serving [3, 39], Rafiki [49], ML2
itself, and others [13, 14, 36, 12] focus mainly on ease
of deployment, where pipelines are considered as black
boxes and deployed into containers (e.g., Docker [9] in
Clipper and Rafiki, servables in TensorFlow Serving).
Under this strategy, only “pipeline-agnostic” optimiza-
tions such as caching, batching and buffering are avail-
able. Nevertheless, we found that black box approaches
fell short on several aspects. For instance, prediction ser-
1
var fContext = ...;
var Tokenizer = ...;
return fPrgm.Plan();
(1) Flour
Transforms
Logical Stages
S1 S2 S3 1: [x]
2: [y,z]
3: …
int[100]
float[200]
…
Params Stats
Physical Stages
S1 S2 S3
(3) Compilation
(2) Optimization
Model
Stats
Params
Logical
Stages
Physical
Stages
Model Plan
Figure 6: Model optimization and compilation in PRET-
ZEL. In (1), a model is translated into a Flour program. (2)
Oven Optimizer generates a DAG of logical stages from
the program. Additionally, parameters and statistics are
extracted. (3) A DAG of physical stages is generated by
the Oven Compiler using logical stages, parameters, and
statistics. A model plan is the union of all the elements
and is fed to the runtime.
(such as most featurizers) are pipelined together in a sin-
gle pass over the data. This strategy achieves best data
locality because records are likely to reside in CPU reg-
isters [33, 38]. Compute-intensive transformations (e.g.,
vectors or matrices multiplications) are executed one-at-
a-time so that Single Instruction, Multiple Data (SIMD)
vectorization can be exploited, therefore optimizing the
number of instructions per record [56, 20]. Stages are gen-
erated by traversing the Flour transformation graph until a
pipeline-breaking transformation is found: all transforma-
tions up to that point are then grouped into a single stage.
Within each stage, compute-bound operations are vec-
torized. This strategy allows PRETZEL to considerably
decrease the number of vectors (and as a consequence
memory) with respect to the operator-at-a-time strategy
of ML2. Dependencies between stages are created as ag-
gregation of the dependencies between the internal trans-
recognize that the Linear Regression can be pushed into
Char and WordNgram, therefore bypassing the execution
of Concat. Additionally, Tokenizer can be reused between
Char and WordNgram, therefore it will be pipelined with
CharNgram (in one stage) and a dependency between
CharNgram and WordNGram (in another stage) will be
created. The final plan will therefore be composed by 2
stages, versus the initial 4 operators (and vectors) of ML2.
Model Plan Compiler: Model plans are composed by
two DAGs: a DAG composed of logical stages, and a
DAG of physical stages. Logical stages are an abstrac-
tion of the stages output of the Oven Optimizer, with re-
lated parameters; physical stages contains the actual code
that will be executed by the PRETZEL runtime. For.each
given DAG, there is a 1-to-n mapping between logical to
physical stages so that a logical stage can represent the
execution code of different physical implementations. A
physical implementation is selected based on the parame-
ters characterizing a logical stage and available statistics.
Plan compilation is a two step process. After the stage
DAG is generated by the Oven Optimizer, the Model
Plan Compiler (MPC) maps each stage into its logical
representation containing all the parameters for the trans-
formations composing the original stage generated by the
optimizer. Parameters are saved for reuse in the Object
Store (Section 4.1.3). Once the logical plan is generated,
MPC traverses the DAG in topological order and maps
each logical stage into a physical implementation. Phys-
ical implementations are AOT-compiled, parameterized,
lock-free computation units. Each physical stage can be
seen as a parametric function which will be dynamically
fed at runtime with the proper data vectors and pipeline-
specific parameters. This design allows PRETZEL runtime
to share the same physical implementation between mul-
tiple pipelines and no memory allocation occurs on the
prediction-path (more details in Section 4.2.1). Logical
plans maintain the mapping between the pipeline-specific
parameters saved in the Object Store and the physical
stages executing on the runtime. Figure 6 summarized the
process of generating model plans out of ML2 pipelines.
var fContext = ...;
var Tokenizer = ...;
return fPrgm.Plan();
(1) Flour
Transforms
Logical Stages
S1 S2 S3 1: [x]
2: [y,z]
3: …
int[100]
float[200]
…
Params Stats
Physical Stages
S1 S2 S3
(3) Compilation
(2) Optimization
Model
Stats
Params
Logical
Stages
Physical
Stages
Model Plan
Figure 6: Model optimization and compilation in PRET-
ZEL. In (1), a model is translated into a Flour program. (2)
Oven Optimizer generates a DAG of logical stages from
the program. Additionally, parameters and statistics are
extracted. (3) A DAG of physical stages is generated by
the Oven Compiler using logical stages, parameters, and
statistics. A model plan is the union of all the elements
and is fed to the runtime.
(such as most featurizers) are pipelined together in a sin-
gle pass over the data. This strategy achieves best data
locality because records are likely to reside in CPU reg-
isters [33, 38]. Compute-intensive transformations (e.g.,
vectors or matrices multiplications) are executed one-at-
a-time so that Single Instruction, Multiple Data (SIMD)
vectorization can be exploited, therefore optimizing the
number of instructions per record [56, 20]. Stages are gen-
erated by traversing the Flour transformation graph until a
pipeline-breaking transformation is found: all transforma-
tions up to that point are then grouped into a single stage.
Within each stage, compute-bound operations are vec-
torized. This strategy allows PRETZEL to considerably
decrease the number of vectors (and as a consequence
memory) with respect to the operator-at-a-time strategy
of ML2. Dependencies between stages are created as ag-
gregation of the dependencies between the internal trans-
formations. Transformation classes are annotated (e.g.,
1-to-1, 1-to-n, memory-bound, compute-bound, commu-
tative and associative) to ease the optimization process:
no dynamic compilation [26] is necessary since the set
of operators is fixed and manual annotation is sufficient
to generate properly optimized plans 4. In the example
sentiment analysis pipeline of Figure 1, Oven is able to
4Note that ML2 does provide a second order operator accepting
arbitrary code requiring dynamic compilation. However, this is not
supported in our current version of PRETZEL.
recognize that the Linear Regression can be pushed into
Char and WordNgram, therefore bypassing the execution
of Concat. Additionally, Tokenizer can be reused between
Char and WordNgram, therefore it will be pipelined with
CharNgram (in one stage) and a dependency between
CharNgram and WordNGram (in another stage) will be
created. The final plan will therefore be composed by 2
stages, versus the initial 4 operators (and vectors) of ML2.
Model Plan Compiler: Model plans are composed by
two DAGs: a DAG composed of logical stages, and a
DAG of physical stages. Logical stages are an abstrac-
tion of the stages output of the Oven Optimizer, with re-
lated parameters; physical stages contains the actual code
that will be executed by the PRETZEL runtime. For.each
given DAG, there is a 1-to-n mapping between logical to
physical stages so that a logical stage can represent the
execution code of different physical implementations. A
physical implementation is selected based on the parame-
ters characterizing a logical stage and available statistics.
Plan compilation is a two step process. After the stage
DAG is generated by the Oven Optimizer, the Model
Plan Compiler (MPC) maps each stage into its logical
representation containing all the parameters for the trans-
formations composing the original stage generated by the
optimizer. Parameters are saved for reuse in the Object
Store (Section 4.1.3). Once the logical plan is generated,
MPC traverses the DAG in topological order and maps
each logical stage into a physical implementation. Phys-
ical implementations are AOT-compiled, parameterized,
lock-free computation units. Each physical stage can be
seen as a parametric function which will be dynamically
fed at runtime with the proper data vectors and pipeline-
specific parameters. This design allows PRETZEL runtime
to share the same physical implementation between mul-
tiple pipelines and no memory allocation occurs on the
prediction-path (more details in Section 4.2.1). Logical
plans maintain the mapping between the pipeline-specific
parameters saved in the Object Store and the physical
stages executing on the runtime. Figure 6 summarized the
process of generating model plans out of ML2 pipelines.
4.1.3 Object Store
The motivation behind Object Store is based on the in-
sights of Figure 3: since many DAGs have similar struc-
tures, sharing operators’ state (parameters) can consid-
erably improve memory footprint, and consequently the
number of predictions served per machine. An example
is language dictionaries used for input text featurization,
which are often in common among many models and are
relatively large. The Object Store is populated off-line by
var fContext = ...;
var Tokenizer = ...;
return fPrgm.Plan();
(1) Flour
Transforms
Logical Stages
S1 S2 S3 1: [x]
2: [y,z]
3: …
int[100]
float[200]
…
Params Stats
Physical Stages
S1 S2 S3
(3) Compilation
(2) Optimization
Model
Stats
Params
Logical
Stages
Physical
Stages
Model Plan
Figure 6: Model optimization and compilation in PRET-
ZEL. In (1), a model is translated into a Flour program. (2)
Oven Optimizer generates a DAG of logical stages from
the program. Additionally, parameters and statistics are
extracted. (3) A DAG of physical stages is generated by
the Oven Compiler using logical stages, parameters, and
statistics. A model plan is the union of all the elements
and is fed to the runtime.
(such as most featurizers) are pipelined together in a sin-
gle pass over the data. This strategy achieves best data
locality because records are likely to reside in CPU reg-
isters [33, 38]. Compute-intensive transformations (e.g.,
vectors or matrices multiplications) are executed one-at-
a-time so that Single Instruction, Multiple Data (SIMD)
vectorization can be exploited, therefore optimizing the
number of instructions per record [56, 20]. Stages are gen-
erated by traversing the Flour transformation graph until a
pipeline-breaking transformation is found: all transforma-
tions up to that point are then grouped into a single stage.
Within each stage, compute-bound operations are vec-
torized. This strategy allows PRETZEL to considerably
decrease the number of vectors (and as a consequence
memory) with respect to the operator-at-a-time strategy
of ML2. Dependencies between stages are created as ag-
gregation of the dependencies between the internal trans-
formations. Transformation classes are annotated (e.g.,
1-to-1, 1-to-n, memory-bound, compute-bound, commu-
tative and associative) to ease the optimization process:
no dynamic compilation [26] is necessary since the set
of operators is fixed and manual annotation is sufficient
to generate properly optimized plans 4. In the example
sentiment analysis pipeline of Figure 1, Oven is able to
recognize that the Linear Regression can be pushed into
Char and WordNgram, therefore bypassing the execution
of Concat. Additionally, Tokenizer can be reused between
Char and WordNgram, therefore it will be pipelined with
CharNgram (in one stage) and a dependency between
CharNgram and WordNGram (in another stage) will be
created. The final plan will therefore be composed by 2
stages, versus the initial 4 operators (and vectors) of ML2.
Model Plan Compiler: Model plans are composed by
two DAGs: a DAG composed of logical stages, and a
DAG of physical stages. Logical stages are an abstrac-
tion of the stages output of the Oven Optimizer, with re-
lated parameters; physical stages contains the actual code
that will be executed by the PRETZEL runtime. For.each
given DAG, there is a 1-to-n mapping between logical to
physical stages so that a logical stage can represent the
execution code of different physical implementations. A
physical implementation is selected based on the parame-
ters characterizing a logical stage and available statistics.
Plan compilation is a two step process. After the stage
DAG is generated by the Oven Optimizer, the Model
Plan Compiler (MPC) maps each stage into its logical
representation containing all the parameters for the trans-
formations composing the original stage generated by the
optimizer. Parameters are saved for reuse in the Object
Store (Section 4.1.3). Once the logical plan is generated,
MPC traverses the DAG in topological order and maps
each logical stage into a physical implementation. Phys-
ical implementations are AOT-compiled, parameterized,
lock-free computation units. Each physical stage can be
seen as a parametric function which will be dynamically
fed at runtime with the proper data vectors and pipeline-
specific parameters. This design allows PRETZEL runtime
to share the same physical implementation between mul-
tiple pipelines and no memory allocation occurs on the
prediction-path (more details in Section 4.2.1). Logical
plans maintain the mapping between the pipeline-specific
parameters saved in the Object Store and the physical
stages executing on the runtime. Figure 6 summarized the
process of generating model plans out of ML2 pipelines.
4.1.3 Object Store
The motivation behind Object Store is based on the in-
sights of Figure 3: since many DAGs have similar struc-
tures, sharing operators’ state (parameters) can consid-
erably improve memory footprint, and consequently the
var fContext = ...;
var Tokenizer = ...;
return fPrgm.Plan();
(1) Flour
Transforms
Logical Stages
S1 S2 S3 1: [x]
2: [y,z]
3: …
int[100]
float[200]
…
Params Stats
Physical Stages
S1 S2 S3
(3) Compilation
(2) Optimization
Model
Stats
Params
Logical
Stages
Physical
Stages
Model Plan
Figure 6: Model optimization and compilation in PRET-
ZEL. In (1), a model is translated into a Flour program. (2)
Oven Optimizer generates a DAG of logical stages from
the program. Additionally, parameters and statistics are
extracted. (3) A DAG of physical stages is generated by
the Oven Compiler using logical stages, parameters, and
statistics. A model plan is the union of all the elements
and is fed to the runtime.
(such as most featurizers) are pipelined together in a sin-
gle pass over the data. This strategy achieves best data
locality because records are likely to reside in CPU reg-
isters [33, 38]. Compute-intensive transformations (e.g.,
vectors or matrices multiplications) are executed one-at-
a-time so that Single Instruction, Multiple Data (SIMD)
vectorization can be exploited, therefore optimizing the
number of instructions per record [56, 20]. Stages are gen-
erated by traversing the Flour transformation graph until a
pipeline-breaking transformation is found: all transforma-
tions up to that point are then grouped into a single stage.
Within each stage, compute-bound operations are vec-
torized. This strategy allows PRETZEL to considerably
decrease the number of vectors (and as a consequence
memory) with respect to the operator-at-a-time strategy
of ML2. Dependencies between stages are created as ag-
gregation of the dependencies between the internal trans-
formations. Transformation classes are annotated (e.g.,
1-to-1, 1-to-n, memory-bound, compute-bound, commu-
tative and associative) to ease the optimization process:
no dynamic compilation [26] is necessary since the set
of operators is fixed and manual annotation is sufficient
to generate properly optimized plans 4. In the example
sentiment analysis pipeline of Figure 1, Oven is able to
4Note that ML2 does provide a second order operator accepting
arbitrary code requiring dynamic compilation. However, this is not
supported in our current version of PRETZEL.
Char and WordNgram, therefore bypassing the e
of Concat. Additionally, Tokenizer can be reused
Char and WordNgram, therefore it will be pipelin
CharNgram (in one stage) and a dependency
CharNgram and WordNGram (in another stage
created. The final plan will therefore be compo
stages, versus the initial 4 operators (and vectors)
Model Plan Compiler: Model plans are comp
two DAGs: a DAG composed of logical stage
DAG of physical stages. Logical stages are an
tion of the stages output of the Oven Optimizer,
lated parameters; physical stages contains the act
that will be executed by the PRETZEL runtime.
given DAG, there is a 1-to-n mapping between lo
physical stages so that a logical stage can repre
execution code of different physical implementa
physical implementation is selected based on the
ters characterizing a logical stage and available s
Plan compilation is a two step process. After t
DAG is generated by the Oven Optimizer, the
Plan Compiler (MPC) maps each stage into its
representation containing all the parameters for t
formations composing the original stage generate
optimizer. Parameters are saved for reuse in the
Store (Section 4.1.3). Once the logical plan is ge
MPC traverses the DAG in topological order an
each logical stage into a physical implementatio
ical implementations are AOT-compiled, param
lock-free computation units. Each physical stag
seen as a parametric function which will be dyn
fed at runtime with the proper data vectors and p
specific parameters. This design allows PRETZEL
to share the same physical implementation betw
tiple pipelines and no memory allocation occur
prediction-path (more details in Section 4.2.1).
plans maintain the mapping between the pipeline
parameters saved in the Object Store and the
stages executing on the runtime. Figure 6 summa
process of generating model plans out of ML2 p
4.1.3 Object Store
The motivation behind Object Store is based o
sights of Figure 3: since many DAGs have simil
tures, sharing operators’ state (parameters) can
erably improve memory footprint, and conseque
number of predictions served per machine. An e
is language dictionaries used for input text featu
which are often in common among many models
relatively large. The Object Store is populated of
7
Oven
Optimiser
var fContext = ...;
var Tokenizer = ...;
return fPrgm.Plan();
(1) Flour
Transforms
Logical Stages
S1 S2 S3 1: [x]
2: [y,z]
3: …
int[100]
float[200]
…
Params Stats
Physical Stages
S1 S2 S3
(3) Compilation
(2) Optimization
Model
Stats
Params
Logical
Stages
Physical
Stages
Model Plan
Figure 6: Model optimization and compilation in PRET-
ZEL. In (1), a model is translated into a Flour program. (2)
Oven Optimizer generates a DAG of logical stages from
the program. Additionally, parameters and statistics are
extracted. (3) A DAG of physical stages is generated by
the Oven Compiler using logical stages, parameters, and
statistics. A model plan is the union of all the elements
and is fed to the runtime.
(such as most featurizers) are pipelined together in a sin-
gle pass over the data. This strategy achieves best data
locality because records are likely to reside in CPU reg-
isters [33, 38]. Compute-intensive transformations (e.g.,
vectors or matrices multiplications) are executed one-at-
a-time so that Single Instruction, Multiple Data (SIMD)
vectorization can be exploited, therefore optimizing the
number of instructions per record [56, 20]. Stages are gen-
erated by traversing the Flour transformation graph until a
pipeline-breaking transformation is found: all transforma-
tions up to that point are then grouped into a single stage.
Within each stage, compute-bound operations are vec-
torized. This strategy allows PRETZEL to considerably
decrease the number of vectors (and as a consequence
memory) with respect to the operator-at-a-time strategy
of ML2. Dependencies between stages are created as ag-
gregation of the dependencies between the internal trans-
formations. Transformation classes are annotated (e.g.,
1-to-1, 1-to-n, memory-bound, compute-bound, commu-
tative and associative) to ease the optimization process:
no dynamic compilation [26] is necessary since the set
of operators is fixed and manual annotation is sufficient
to generate properly optimized plans 4. In the example
sentiment analysis pipeline of Figure 1, Oven is able to
4Note that ML2 does provide a second order operator accepting
arbitrary code requiring dynamic compilation. However, this is not
supported in our current version of PRETZEL.
recognize that the Linear Regression can be pushed into
Char and WordNgram, therefore bypassing the execution
of Concat. Additionally, Tokenizer can be reused between
Char and WordNgram, therefore it will be pipelined with
CharNgram (in one stage) and a dependency between
CharNgram and WordNGram (in another stage) will be
created. The final plan will therefore be composed by 2
stages, versus the initial 4 operators (and vectors) of ML2.
Model Plan Compiler: Model plans are composed by
two DAGs: a DAG composed of logical stages, and a
DAG of physical stages. Logical stages are an abstrac-
tion of the stages output of the Oven Optimizer, with re-
lated parameters; physical stages contains the actual code
that will be executed by the PRETZEL runtime. For.each
given DAG, there is a 1-to-n mapping between logical to
physical stages so that a logical stage can represent the
execution code of different physical implementations. A
physical implementation is selected based on the parame-
ters characterizing a logical stage and available statistics.
Plan compilation is a two step process. After the stage
DAG is generated by the Oven Optimizer, the Model
Plan Compiler (MPC) maps each stage into its logical
representation containing all the parameters for the trans-
formations composing the original stage generated by the
optimizer. Parameters are saved for reuse in the Object
Store (Section 4.1.3). Once the logical plan is generated,
MPC traverses the DAG in topological order and maps
each logical stage into a physical implementation. Phys-
ical implementations are AOT-compiled, parameterized,
lock-free computation units. Each physical stage can be
seen as a parametric function which will be dynamically
fed at runtime with the proper data vectors and pipeline-
specific parameters. This design allows PRETZEL runtime
to share the same physical implementation between mul-
tiple pipelines and no memory allocation occurs on the
prediction-path (more details in Section 4.2.1). Logical
plans maintain the mapping between the pipeline-specific
parameters saved in the Object Store and the physical
stages executing on the runtime. Figure 6 summarized the
process of generating model plans out of ML2 pipelines.
4.1.3 Object Store
The motivation behind Object Store is based on the in-
sights of Figure 3: since many DAGs have similar struc-
tures, sharing operators’ state (parameters) can consid-
erably improve memory footprint, and consequently the
number of predictions served per machine. An example
is language dictionaries used for input text featurization,
which are often in common among many models and are
relatively large. The Object Store is populated off-line by
7
var fContext = ...;
var Tokenizer = ...;
return fPrgm.Plan();
(1) Flour
Transforms
Logical Stages
S1 S2 S3 1: [x]
2: [y,z]
3: …
int[100]
float[200]
…
Params Stats
Physical Stages
S1 S2 S3
(3) Compilation
(2) Optimization
Model
Stats
Params
Logical
Stages
Physical
Stages
Model Plan
Figure 6: Model optimization and compilation in PRET-
ZEL. In (1), a model is translated into a Flour program. (2)
Oven Optimizer generates a DAG of logical stages from
the program. Additionally, parameters and statistics are
extracted. (3) A DAG of physical stages is generated by
the Oven Compiler using logical stages, parameters, and
statistics. A model plan is the union of all the elements
and is fed to the runtime.
(such as most featurizers) are pipelined together in a sin-
gle pass over the data. This strategy achieves best data
locality because records are likely to reside in CPU reg-
isters [33, 38]. Compute-intensive transformations (e.g.,
vectors or matrices multiplications) are executed one-at-
a-time so that Single Instruction, Multiple Data (SIMD)
vectorization can be exploited, therefore optimizing the
number of instructions per record [56, 20]. Stages are gen-
erated by traversing the Flour transformation graph until a
pipeline-breaking transformation is found: all transforma-
tions up to that point are then grouped into a single stage.
Within each stage, compute-bound operations are vec-
torized. This strategy allows PRETZEL to considerably
decrease the number of vectors (and as a consequence
memory) with respect to the operator-at-a-time strategy
of ML2. Dependencies between stages are created as ag-
gregation of the dependencies between the internal trans-
formations. Transformation classes are annotated (e.g.,
1-to-1, 1-to-n, memory-bound, compute-bound, commu-
tative and associative) to ease the optimization process:
no dynamic compilation [26] is necessary since the set
of operators is fixed and manual annotation is sufficient
to generate properly optimized plans 4. In the example
sentiment analysis pipeline of Figure 1, Oven is able to
recognize that the Linear Regression can be pushed into
Char and WordNgram, therefore bypassing the execution
of Concat. Additionally, Tokenizer can be reused between
Char and WordNgram, therefore it will be pipelined with
CharNgram (in one stage) and a dependency between
CharNgram and WordNGram (in another stage) will be
created. The final plan will therefore be composed by 2
stages, versus the initial 4 operators (and vectors) of ML2.
Model Plan Compiler: Model plans are composed by
two DAGs: a DAG composed of logical stages, and a
DAG of physical stages. Logical stages are an abstrac-
tion of the stages output of the Oven Optimizer, with re-
lated parameters; physical stages contains the actual code
that will be executed by the PRETZEL runtime. For.each
given DAG, there is a 1-to-n mapping between logical to
physical stages so that a logical stage can represent the
execution code of different physical implementations. A
physical implementation is selected based on the parame-
ters characterizing a logical stage and available statistics.
Plan compilation is a two step process. After the stage
DAG is generated by the Oven Optimizer, the Model
Plan Compiler (MPC) maps each stage into its logical
representation containing all the parameters for the trans-
formations composing the original stage generated by the
optimizer. Parameters are saved for reuse in the Object
Store (Section 4.1.3). Once the logical plan is generated,
MPC traverses the DAG in topological order and maps
each logical stage into a physical implementation. Phys-
ical implementations are AOT-compiled, parameterized,
lock-free computation units. Each physical stage can be
seen as a parametric function which will be dynamically
fed at runtime with the proper data vectors and pipeline-
specific parameters. This design allows PRETZEL runtime
to share the same physical implementation between mul-
tiple pipelines and no memory allocation occurs on the
prediction-path (more details in Section 4.2.1). Logical
plans maintain the mapping between the pipeline-specific
parameters saved in the Object Store and the physical
stages executing on the runtime. Figure 6 summarized the
process of generating model plans out of ML2 pipelines.
4.1.3 Object Store
The motivation behind Object Store is based on the in-
sights of Figure 3: since many DAGs have similar struc-
tures, sharing operators’ state (parameters) can consid-
erably improve memory footprint, and consequently the
Flour API
• Two main components:
– Runtime, with an Object Store
– Scheduler
• Runtime handles physical resources: threads and buffers
• Object Store caches objects of all models
– models register and retrieve state objects via a key
• Scheduler is event-based, each stage being an event
13
On-line phase
EVALUATION
14
• Two model classes written in ML.NET, running in ML.NET and
Pretzel
– 250 Sentiment-Analysis models
– 50 Regression Task models
• Testbed representing a small production server
– 2 8-core Xeon E5-2620 v4 at 2.10 GHz, HT enabled
– 32 GB RAM
– Windows 10
15
Workload and testbed
• 7x less memory for SA models, 40% for RT models
• Higher model density, higher efficiency and
profitability
16
Memory
Figure 8: Cumulative memory usage of the model
pipelines with and without Object Store, normalized by
the maximum usage with Object Store for SA models.
By sharing common objects, we can decrease memory
usage by up to 7 ⇥ in SA and around 40% for RT models.
Keeping track of plan parameters can also help reduce the
time for loading models: as expected PRETZEL can load
300 models 7.4 times faster compared to ML2.
5.2 Latency
In this experiment we study the latency behavior of PRET-
ZEL and compare it with ML2. We also study hot and cold
scenarios. Specifically, we label as cold the first predic-
tion requested for a model; the successive 10 predictions
are then discarded and we report hot numbers as the aver-
age of the following 100 predictions. Inference requests
are submitted sequentially and in isolation for one model
at a time. For PRETZEL we use the request-response en-
gine. The comparison between PRETZEL and ML2 is
reported in Figure 9. As we can see, in ML2 the 99th per-
centile latency in cold (denoted by P99cold) is more than
an order-of-magnitude slower than P99hot, with a peak of
almost a 300⇥ slowdown in the worst case. Conversely,
in PRETZEL P99cold is within 3⇥ from the P99hot and
Figure 9: Latency comparison between ML2 and PRET-
ZEL, with values normalized over ML2’s P99hot latency.
Cliffs indicate pipelines with big differences in latency.
Figure 10: Latency of PRETZEL to run SA models with
and without sub-plan materialization, normalized by the
P99hot latency of the value obtained without the optimiza-
tion. Around 85% of SA pipelines show 2.5⇥ speedup.
Sub-plan materialization does not apply for RT pipelines.
sult caching enabled, latency decreases even further when
same inferences are requested for the same model. In this
case the latency is proportional to a dictionary look up.
5.3 Throughput
In this experiment, we assume a batch scenario where all
300 models are scored several times using a fixed batch
• Hot vs Cold scenarios
• Without and with Caching of partial results
17
Latency
8: Cumulative memory usage of the model
es with and without Object Store, normalized by
ximum usage with Object Store for SA models.
ring common objects, we can decrease memory
y up to 7 ⇥ in SA and around 40% for RT models.
g track of plan parameters can also help reduce the
r loading models: as expected PRETZEL can load
dels 7.4 times faster compared to ML2.
Latency
xperiment we study the latency behavior of PRET-
d compare it with ML2. We also study hot and cold
os. Specifically, we label as cold the first predic-
uested for a model; the successive 10 predictions
discarded and we report hot numbers as the aver-
he following 100 predictions. Inference requests
mitted sequentially and in isolation for one model
e. For PRETZEL we use the request-response en-
The comparison between PRETZEL and ML2 is
d in Figure 9. As we can see, in ML2 the 99th per-
atency in cold (denoted by P99cold) is more than
r-of-magnitude slower than P99hot, with a peak of
a 300⇥ slowdown in the worst case. Conversely,
TZEL P99cold is within 3⇥ from the P99hot and
Figure 9: Latency comparison between ML2 and PRET-
ZEL, with values normalized over ML2’s P99hot latency.
Cliffs indicate pipelines with big differences in latency.
Figure 10: Latency of PRETZEL to run SA models with
and without sub-plan materialization, normalized by the
P99hot latency of the value obtained without the optimiza-
tion. Around 85% of SA pipelines show 2.5⇥ speedup.
Sub-plan materialization does not apply for RT pipelines.
sult caching enabled, latency decreases even further when
same inferences are requested for the same model. In this
case the latency is proportional to a dictionary look up.
5.3 Throughput
In this experiment, we assume a batch scenario where all
300 models are scored several times using a fixed batch
Figure 8: Cumulative memory usage of the model
pipelines with and without Object Store, normalized by
the maximum usage with Object Store for SA models.
By sharing common objects, we can decrease memory
usage by up to 7 ⇥ in SA and around 40% for RT models.
Keeping track of plan parameters can also help reduce the
time for loading models: as expected PRETZEL can load
300 models 7.4 times faster compared to ML2.
5.2 Latency
In this experiment we study the latency behavior of PRET-
ZEL and compare it with ML2. We also study hot and cold
scenarios. Specifically, we label as cold the first predic-
tion requested for a model; the successive 10 predictions
are then discarded and we report hot numbers as the aver-
age of the following 100 predictions. Inference requests
are submitted sequentially and in isolation for one model
at a time. For PRETZEL we use the request-response en-
gine. The comparison between PRETZEL and ML2 is
reported in Figure 9. As we can see, in ML2 the 99th per-
centile latency in cold (denoted by P99cold) is more than
an order-of-magnitude slower than P99hot, with a peak of
almost a 300⇥ slowdown in the worst case. Conversely,
in PRETZEL P99cold is within 3⇥ from the P99hot and
20⇥ in the worst case (an improvement of more than 10⇥
on tail latency). If instead we compare directly PRET-
ZEL with ML2, PRETZEL is 2.6⇥ faster then ML2 in the
P99hot case, and about 15⇥ in the P99cold case. Peak per-
formance gains for PRETZEL for SA pipelines is above
50⇥, 100⇥ in the RT category.
Sub-plan Materialization: Figure 10 depicts the effect
of the sub-plan materialization in prediction latency. As
we can notice, only part of the pipelines benefit from
materializing partial results, while no pipeline shows per-
formance deterioration (except the first cold scoring of
the first model, which shows a 6⇥ slowdown). In general,
Figure 9: Latency comparison between ML2 and PRET-
ZEL, with values normalized over ML2’s P99hot latency.
Cliffs indicate pipelines with big differences in latency.
Figure 10: Latency of PRETZEL to run SA models with
and without sub-plan materialization, normalized by the
P99hot latency of the value obtained without the optimiza-
tion. Around 85% of SA pipelines show 2.5⇥ speedup.
Sub-plan materialization does not apply for RT pipelines.
sult caching enabled, latency decreases even further when
same inferences are requested for the same model. In this
case the latency is proportional to a dictionary look up.
5.3 Throughput
In this experiment, we assume a batch scenario where all
300 models are scored several times using a fixed batch
size of 1000. We used from 2 up to 29 cores to serve
requests, while 3 cores are reserved to generate them. The
main goal is to measure the maximum number of requests
PRETZEL and ML2 can serve per second. Materialization
and caching are not enabled for this experiment.
Figure 11 shows that PRETZEL’s throughput is up to
2.5⇥ higher than ML2 (at 16 cores scale) and scales on
par with the expected maximum scaling 5 whereas ML2
suffers from decrease in scalability when the number of
CPU cores increases. This is because each thread has
its own internal copy of models whereby cache lines are
5
W/o Caching W/o and w/ Caching
• ML.NET cold: 10-300x w.r.t. hot
• Pretzel cold: 3-20x slower than hot
• Pretzel vs ML:NET: cold is 15x faster,
hot is 2.6x
• For single models runs, average
improvement is 2.5x
• More gain with multiple requests
• Batch size is 1000 inputs, multiple runs of same batch
• Delay batching: as in Clipper, batch requests and accept
higher latency
18
Throughput
Figure 11: The average throughput computed among the
300 models to process one million inputs each. We scale
the number of cores on the x-axis. PRETZEL scales lin-
early to the number of CPU cores, close to the expected
maximum throughput with Hyper Threading enabled.
not shared, thus increasing the pressure on the memory
subsystem: indeed, even if the data values are the same,
the model objects are mapped to different memory areas.
Delay Batching: As in Clipper, PRETZEL FrontEnd al-
lows users to specify a maximum delay they can wait to
maximize throughput. For each model category, Figure 12
depicts the trade-off between throughput and latency as
the delay increases. Interestingly, for such models even a
small delay helps reaching the optimal batch size.
Figure 12: Throughput and latency of SA and RT models
latency, instead, gracefully increases linearly with the
increase of the load.
Figure 13: Throughput and latency of PRETZEL under
heavy load scenario.
Reservation Scheduling: If we run the previous exper-
iment using reservation scheduling for one model, such
model does not encounter any degradation in latency (max
improvement of 3 orders-of-magnitude) as the load in-
creases, while the total throughput increases about 3%.
In the experiment, 28 cores are used for the shared load,
while one core (and related vectors) is reserved exclu-
sively for one model.
6 Limitations and Future Work
Off-line Phase: PRETZEL suffers two limitations regard-
ing Flour and Oven design. First, PRETZEL currently has
several logical and physical stages classes, one per possi-
ble implementation, which make the system difficult to
maintain in the long run. Additionally, different back-ends
(e.g., PRETZEL currently supports operators implemented
in C# and C++, and experimentally on FPGA [X]) require
all specific operator implementations. We are however
confident that this limitation will be overcome once code
300 models to process one million inputs each. We scale
the number of cores on the x-axis. PRETZEL scales lin-
early to the number of CPU cores, close to the expected
maximum throughput with Hyper Threading enabled.
not shared, thus increasing the pressure on the memory
subsystem: indeed, even if the data values are the same,
the model objects are mapped to different memory areas.
Delay Batching: As in Clipper, PRETZEL FrontEnd al-
lows users to specify a maximum delay they can wait to
maximize throughput. For each model category, Figure 12
depicts the trade-off between throughput and latency as
the delay increases. Interestingly, for such models even a
small delay helps reaching the optimal batch size.
Figure 12: Throughput and latency of SA and RT models
with delayed batching: while throughput increases with
the delay, also record latency increase. RT latency is more
tolerant to larger delays compared to SA.
5.4 Heavy Load
In this last experiment we load all 300 models in one in-
stance and we submit requests to all models concurrently
following a Zipf distribution. As for the throughput exper-
iment we use 3 threads to generate load and 29 to serve
requests. Figure 13 reports how throughput and latency
vary as we increase the load. As expected, PRETZEL’s
throughput increases initially and stabilizes. PRETZEL’s
Figure 13: Throughput and latency of PRET
heavy load scenario.
Reservation Scheduling: If we run the prev
iment using reservation scheduling for one m
model does not encounter any degradation in la
improvement of 3 orders-of-magnitude) as t
creases, while the total throughput increases
In the experiment, 28 cores are used for the s
while one core (and related vectors) is reser
sively for one model.
6 Limitations and Future Work
Off-line Phase: PRETZEL suffers two limitati
ing Flour and Oven design. First, PRETZEL cu
several logical and physical stages classes, on
ble implementation, which make the system
maintain in the long run. Additionally, differen
(e.g., PRETZEL currently supports operators im
in C# and C++, and experimentally on FPGA [
all specific operator implementations. We ar
confident that this limitation will be overcome
generation of stages will be added (e.g., with
specific templates [34]). Secondly, Flour and
currently limited to pipelines authored in ML
ing models from different frameworks to the
approach may require non-trivial work. On th
our goal is, however, to target unified model fo
as ONNX [5]; this will allow us to apply the
techniques to models from other ML framewo
On-line Phase: PRETZEL’s fine-grained, s
scheduling may introduce additional overhea
spect to coarse-grained whole pipeline schedu
additional buffering and context switching.
heads are, however, related to the system load
fore controllable by the scheduler. On white bo
tures, failures happening during the execution
11
W/o Delay batching
• ML.NET suffers from missed data
sharing, i.e. higher memory traffic
W/ Delay batching
• Not supported in ML.NET
• Quickly reaches optimal batch size
CONCLUSIONS AND FUTURE WORK
19
• We addressed performance/density bottlenecks in ML
inference for MaaS
• We advocate the adoption of a white-box approach
• We are re-thinking ML inference as a DB query problem
- buffer/locality issues
- code generation
- runtime, hardware-aware scheduling
20
Conclusions
• Oven has pre-written implementations for logical and
physical stages: not maintainable in the long run
• We are adding code-generation of stages, maybe with
hardware-specific templates [10]
• Tune code-generation for more-aggressive optimizations
• Support more model formats than ML.NET, like ONNX [11]
21
Off-line phase
[10] K.Krikellas, S.Viglas,et al. Generating code for holistic query evaluation, in ICDE, pages 613–624. IEEE Computer Society, 2010
[11] Open Neural Network Exchange (ONNX). https://onnx.ai, 2017
• Distributed version
• Main work is on NUMA awareness, to improve locality and
scheduling
– Schedule on the NUMA node where most shared state
resides
– Duplicate hot, shared state objects on nodes for lower
latency
– Scale better, avoiding bottlenecks on coherence bus
22
On-line phase
Pretzel is currently under submission at OSDI ‘18
QUESTIONS ?
• Distributed version
• Main work is on NUMA awareness, to improve locality and
scheduling
– Schedule on the NUMA node where most shared state
resides
– Duplicate hot, shared state objects on nodes for lower
latency
– Scale better, avoiding bottlenecks on coherence bus
Speaker: Alberto Scolari, PhD student @ Politecnico di Milano, Italy - alberto.scolari@polimi.it
BACKUP
23
• Oven’s Model Plan Compiler compiles logical stages into
physical stages, based on parameters and statistics
24
Off-line phase - Oven
[9] A.Crotty,A.Galakatos,K.Dursun, et al. An architecture for compiling UDF-centric workflows. PVLDB, 8(12):1466–1477, Aug. 2015
var fContext = ...;
var Tokenizer = ...;
return fPrgm.Plan();
(1) Flour
Transforms
Logical Stages
S1 S2 S3 1: [x]
2: [y,z]
3: …
int[100]
float[200]
…
Params Stats
Physical Stages
S1 S2 S3
(3) Compilation
(2) Optimization
Model
Stats
Params
Logical
Stages
Physical
Stages
Model Plan
Figure 6: Model optimization and compilation in PRET-
ZEL. In (1), a model is translated into a Flour program. (2)
Oven Optimizer generates a DAG of logical stages from
the program. Additionally, parameters and statistics are
extracted. (3) A DAG of physical stages is generated by
the Oven Compiler using logical stages, parameters, and
statistics. A model plan is the union of all the elements
and is fed to the runtime.
(such as most featurizers) are pipelined together in a sin-
gle pass over the data. This strategy achieves best data
locality because records are likely to reside in CPU reg-
isters [33, 38]. Compute-intensive transformations (e.g.,
vectors or matrices multiplications) are executed one-at-
a-time so that Single Instruction, Multiple Data (SIMD)
vectorization can be exploited, therefore optimizing the
number of instructions per record [56, 20]. Stages are gen-
erated by traversing the Flour transformation graph until a
pipeline-breaking transformation is found: all transforma-
tions up to that point are then grouped into a single stage.
Within each stage, compute-bound operations are vec-
torized. This strategy allows PRETZEL to considerably
decrease the number of vectors (and as a consequence
memory) with respect to the operator-at-a-time strategy
of ML2. Dependencies between stages are created as ag-
gregation of the dependencies between the internal trans-
formations. Transformation classes are annotated (e.g.,
1-to-1, 1-to-n, memory-bound, compute-bound, commu-
tative and associative) to ease the optimization process:
no dynamic compilation [26] is necessary since the set
of operators is fixed and manual annotation is sufficient
to generate properly optimized plans 4. In the example
sentiment analysis pipeline of Figure 1, Oven is able to
4Note that ML2 does provide a second order operator accepting
arbitrary code requiring dynamic compilation. However, this is not
supported in our current version of PRETZEL.
recognize that the Linear Regression can be pushed into
Char and WordNgram, therefore bypassing the execution
of Concat. Additionally, Tokenizer can be reused between
Char and WordNgram, therefore it will be pipelined with
CharNgram (in one stage) and a dependency between
CharNgram and WordNGram (in another stage) will be
created. The final plan will therefore be composed by 2
stages, versus the initial 4 operators (and vectors) of ML2.
Model Plan Compiler: Model plans are composed by
two DAGs: a DAG composed of logical stages, and a
DAG of physical stages. Logical stages are an abstrac-
tion of the stages output of the Oven Optimizer, with re-
lated parameters; physical stages contains the actual code
that will be executed by the PRETZEL runtime. For.each
given DAG, there is a 1-to-n mapping between logical to
physical stages so that a logical stage can represent the
execution code of different physical implementations. A
physical implementation is selected based on the parame-
ters characterizing a logical stage and available statistics.
Plan compilation is a two step process. After the stage
DAG is generated by the Oven Optimizer, the Model
Plan Compiler (MPC) maps each stage into its logical
representation containing all the parameters for the trans-
formations composing the original stage generated by the
optimizer. Parameters are saved for reuse in the Object
Store (Section 4.1.3). Once the logical plan is generated,
MPC traverses the DAG in topological order and maps
each logical stage into a physical implementation. Phys-
ical implementations are AOT-compiled, parameterized,
lock-free computation units. Each physical stage can be
seen as a parametric function which will be dynamically
fed at runtime with the proper data vectors and pipeline-
specific parameters. This design allows PRETZEL runtime
to share the same physical implementation between mul-
tiple pipelines and no memory allocation occurs on the
prediction-path (more details in Section 4.2.1). Logical
plans maintain the mapping between the pipeline-specific
parameters saved in the Object Store and the physical
stages executing on the runtime. Figure 6 summarized the
process of generating model plans out of ML2 pipelines.
4.1.3 Object Store
The motivation behind Object Store is based on the in-
sights of Figure 3: since many DAGs have similar struc-
tures, sharing operators’ state (parameters) can consid-
erably improve memory footprint, and consequently the
number of predictions served per machine. An example
is language dictionaries used for input text featurization,
which are often in common among many models and are
relatively large. The Object Store is populated off-line by
7
• Oven optimises pipelines much like
queries
• Tupleware’s hybrid approach [9]
– memory-intensive operators are
merged into a logical stage, to
increase data locality
– Compute-intensive operators stay
alone, to leverage SIMD execution
– One-to-one operators are chained
together, while one-to-many can
break locality
• Load all 300 models upfront
• 3 threads to generate requests, 29 to score
• One core serves single requests, the others serve batches
• ML.NET does not support this scenarios
• Latency degrades gracefully with load
25
Heavy load
Figure 11: The average throughput computed among the
300 models to process one million inputs each. We scale
the number of cores on the x-axis. PRETZEL scales lin-
early to the number of CPU cores, close to the expected
maximum throughput with Hyper Threading enabled.
not shared, thus increasing the pressure on the memory
subsystem: indeed, even if the data values are the same,
the model objects are mapped to different memory areas.
Delay Batching: As in Clipper, PRETZEL FrontEnd al-
lows users to specify a maximum delay they can wait to
maximize throughput. For each model category, Figure 12
depicts the trade-off between throughput and latency as
the delay increases. Interestingly, for such models even a
small delay helps reaching the optimal batch size.
Figure 12: Throughput and latency of SA and RT models
latency, instead, gracefully increases linearly with the
increase of the load.
Figure 13: Throughput and latency of PRETZEL under
heavy load scenario.
Reservation Scheduling: If we run the previous exper-
iment using reservation scheduling for one model, such
model does not encounter any degradation in latency (max
improvement of 3 orders-of-magnitude) as the load in-
creases, while the total throughput increases about 3%.
In the experiment, 28 cores are used for the shared load,
while one core (and related vectors) is reserved exclu-
sively for one model.
6 Limitations and Future Work
Off-line Phase: PRETZEL suffers two limitations regard-
ing Flour and Oven design. First, PRETZEL currently has
several logical and physical stages classes, one per possi-
ble implementation, which make the system difficult to
maintain in the long run. Additionally, different back-ends
(e.g., PRETZEL currently supports operators implemented
in C# and C++, and experimentally on FPGA [X]) require
all specific operator implementations. We are however
confident that this limitation will be overcome once code
generation of stages will be added (e.g., with hardware-

Weitere ähnliche Inhalte

Was ist angesagt?

Parallel Algorithm Models
Parallel Algorithm ModelsParallel Algorithm Models
Parallel Algorithm ModelsMartin Coronel
 
Crdom cell re ordering based domino on-the-fly mapping
Crdom  cell re ordering based domino on-the-fly mappingCrdom  cell re ordering based domino on-the-fly mapping
Crdom cell re ordering based domino on-the-fly mappingVLSICS Design
 
Intel Cluster Poisson Solver Library
Intel Cluster Poisson Solver LibraryIntel Cluster Poisson Solver Library
Intel Cluster Poisson Solver LibraryIlya Kryukov
 
Parallel Machine Learning
Parallel Machine LearningParallel Machine Learning
Parallel Machine LearningJanani C
 
Lecture 3 parallel programming platforms
Lecture 3   parallel programming platformsLecture 3   parallel programming platforms
Lecture 3 parallel programming platformsVajira Thambawita
 
Lecture 4 principles of parallel algorithm design updated
Lecture 4   principles of parallel algorithm design updatedLecture 4   principles of parallel algorithm design updated
Lecture 4 principles of parallel algorithm design updatedVajira Thambawita
 
Analyzing consistency models for semi active data replication protocol in dis...
Analyzing consistency models for semi active data replication protocol in dis...Analyzing consistency models for semi active data replication protocol in dis...
Analyzing consistency models for semi active data replication protocol in dis...ijfcstjournal
 
Parallel Computing 2007: Bring your own parallel application
Parallel Computing 2007: Bring your own parallel applicationParallel Computing 2007: Bring your own parallel application
Parallel Computing 2007: Bring your own parallel applicationGeoffrey Fox
 
Designing Run-Time Environments to have Predefined Global Dynamics
Designing  Run-Time  Environments to have Predefined Global DynamicsDesigning  Run-Time  Environments to have Predefined Global Dynamics
Designing Run-Time Environments to have Predefined Global DynamicsIJCNCJournal
 
Problems in Task Scheduling in Multiprocessor System
Problems in Task Scheduling in Multiprocessor SystemProblems in Task Scheduling in Multiprocessor System
Problems in Task Scheduling in Multiprocessor Systemijtsrd
 
Efficient & Lock-Free Modified Skip List in Concurrent Environment
Efficient & Lock-Free Modified Skip List in Concurrent EnvironmentEfficient & Lock-Free Modified Skip List in Concurrent Environment
Efficient & Lock-Free Modified Skip List in Concurrent EnvironmentEditor IJCATR
 
Lecture 2 more about parallel computing
Lecture 2   more about parallel computingLecture 2   more about parallel computing
Lecture 2 more about parallel computingVajira Thambawita
 
A Study of BFLOAT16 for Deep Learning Training
A Study of BFLOAT16 for Deep Learning TrainingA Study of BFLOAT16 for Deep Learning Training
A Study of BFLOAT16 for Deep Learning TrainingSubhajit Sahu
 
Communication costs in parallel machines
Communication costs in parallel machinesCommunication costs in parallel machines
Communication costs in parallel machinesSyed Zaid Irshad
 

Was ist angesagt? (20)

shashank_micro92_00697015
shashank_micro92_00697015shashank_micro92_00697015
shashank_micro92_00697015
 
Solution(1)
Solution(1)Solution(1)
Solution(1)
 
Parallel Algorithm Models
Parallel Algorithm ModelsParallel Algorithm Models
Parallel Algorithm Models
 
Crdom cell re ordering based domino on-the-fly mapping
Crdom  cell re ordering based domino on-the-fly mappingCrdom  cell re ordering based domino on-the-fly mapping
Crdom cell re ordering based domino on-the-fly mapping
 
Intel Cluster Poisson Solver Library
Intel Cluster Poisson Solver LibraryIntel Cluster Poisson Solver Library
Intel Cluster Poisson Solver Library
 
Parallel Machine Learning
Parallel Machine LearningParallel Machine Learning
Parallel Machine Learning
 
Lecture 3 parallel programming platforms
Lecture 3   parallel programming platformsLecture 3   parallel programming platforms
Lecture 3 parallel programming platforms
 
UIC Panella Thesis
UIC Panella ThesisUIC Panella Thesis
UIC Panella Thesis
 
Lecture 4 principles of parallel algorithm design updated
Lecture 4   principles of parallel algorithm design updatedLecture 4   principles of parallel algorithm design updated
Lecture 4 principles of parallel algorithm design updated
 
Analyzing consistency models for semi active data replication protocol in dis...
Analyzing consistency models for semi active data replication protocol in dis...Analyzing consistency models for semi active data replication protocol in dis...
Analyzing consistency models for semi active data replication protocol in dis...
 
Grid computing for load balancing strategies
Grid computing for load balancing strategiesGrid computing for load balancing strategies
Grid computing for load balancing strategies
 
Parallel Computing 2007: Bring your own parallel application
Parallel Computing 2007: Bring your own parallel applicationParallel Computing 2007: Bring your own parallel application
Parallel Computing 2007: Bring your own parallel application
 
Designing Run-Time Environments to have Predefined Global Dynamics
Designing  Run-Time  Environments to have Predefined Global DynamicsDesigning  Run-Time  Environments to have Predefined Global Dynamics
Designing Run-Time Environments to have Predefined Global Dynamics
 
Problems in Task Scheduling in Multiprocessor System
Problems in Task Scheduling in Multiprocessor SystemProblems in Task Scheduling in Multiprocessor System
Problems in Task Scheduling in Multiprocessor System
 
035
035035
035
 
Efficient & Lock-Free Modified Skip List in Concurrent Environment
Efficient & Lock-Free Modified Skip List in Concurrent EnvironmentEfficient & Lock-Free Modified Skip List in Concurrent Environment
Efficient & Lock-Free Modified Skip List in Concurrent Environment
 
Chapter 1 pc
Chapter 1 pcChapter 1 pc
Chapter 1 pc
 
Lecture 2 more about parallel computing
Lecture 2   more about parallel computingLecture 2   more about parallel computing
Lecture 2 more about parallel computing
 
A Study of BFLOAT16 for Deep Learning Training
A Study of BFLOAT16 for Deep Learning TrainingA Study of BFLOAT16 for Deep Learning Training
A Study of BFLOAT16 for Deep Learning Training
 
Communication costs in parallel machines
Communication costs in parallel machinesCommunication costs in parallel machines
Communication costs in parallel machines
 

Ähnlich wie Pretzel: optimized Machine Learning framework for low-latency and high throughput workloads

Pretzel: optimized Machine Learning framework for low-latency and high throug...
Pretzel: optimized Machine Learning framework for low-latency and high throug...Pretzel: optimized Machine Learning framework for low-latency and high throug...
Pretzel: optimized Machine Learning framework for low-latency and high throug...NECST Lab @ Politecnico di Milano
 
Simulation of Heterogeneous Cloud Infrastructures
Simulation of Heterogeneous Cloud InfrastructuresSimulation of Heterogeneous Cloud Infrastructures
Simulation of Heterogeneous Cloud InfrastructuresCloudLightning
 
Flink Forward Berlin 2017: Boris Lublinsky, Stavros Kontopoulos - Introducing...
Flink Forward Berlin 2017: Boris Lublinsky, Stavros Kontopoulos - Introducing...Flink Forward Berlin 2017: Boris Lublinsky, Stavros Kontopoulos - Introducing...
Flink Forward Berlin 2017: Boris Lublinsky, Stavros Kontopoulos - Introducing...Flink Forward
 
PRETZEL: Opening the Black Box of Machine Learning Prediction Serving Systems
PRETZEL: Opening the Black Box of Machine Learning Prediction Serving SystemsPRETZEL: Opening the Black Box of Machine Learning Prediction Serving Systems
PRETZEL: Opening the Black Box of Machine Learning Prediction Serving SystemsNECST Lab @ Politecnico di Milano
 
Art%3 a10.1186%2f1687 6180-2011-29
Art%3 a10.1186%2f1687 6180-2011-29Art%3 a10.1186%2f1687 6180-2011-29
Art%3 a10.1186%2f1687 6180-2011-29Ishtiaq Ahmad
 
2014 IEEE JAVA CLOUD COMPUTING PROJECT Scalable analytics for iaas cloud avai...
2014 IEEE JAVA CLOUD COMPUTING PROJECT Scalable analytics for iaas cloud avai...2014 IEEE JAVA CLOUD COMPUTING PROJECT Scalable analytics for iaas cloud avai...
2014 IEEE JAVA CLOUD COMPUTING PROJECT Scalable analytics for iaas cloud avai...IEEEFINALSEMSTUDENTPROJECTS
 
IEEE 2014 JAVA CLOUD COMPUTING PROJECTS Scalable analytics for iaa s cloud av...
IEEE 2014 JAVA CLOUD COMPUTING PROJECTS Scalable analytics for iaa s cloud av...IEEE 2014 JAVA CLOUD COMPUTING PROJECTS Scalable analytics for iaa s cloud av...
IEEE 2014 JAVA CLOUD COMPUTING PROJECTS Scalable analytics for iaa s cloud av...IEEEGLOBALSOFTSTUDENTPROJECTS
 
CS 301 Computer ArchitectureStudent # 1 EID 09Kingdom of .docx
CS 301 Computer ArchitectureStudent # 1 EID 09Kingdom of .docxCS 301 Computer ArchitectureStudent # 1 EID 09Kingdom of .docx
CS 301 Computer ArchitectureStudent # 1 EID 09Kingdom of .docxfaithxdunce63732
 
IEEE 2014 JAVA CLOUD COMPUTING PROJECTS A stochastic model to investigate dat...
IEEE 2014 JAVA CLOUD COMPUTING PROJECTS A stochastic model to investigate dat...IEEE 2014 JAVA CLOUD COMPUTING PROJECTS A stochastic model to investigate dat...
IEEE 2014 JAVA CLOUD COMPUTING PROJECTS A stochastic model to investigate dat...IEEEGLOBALSOFTSTUDENTPROJECTS
 
2014 IEEE JAVA CLOUD COMPUTING PROJECT A stochastic model to investigate data...
2014 IEEE JAVA CLOUD COMPUTING PROJECT A stochastic model to investigate data...2014 IEEE JAVA CLOUD COMPUTING PROJECT A stochastic model to investigate data...
2014 IEEE JAVA CLOUD COMPUTING PROJECT A stochastic model to investigate data...IEEEFINALYEARSTUDENTPROJECT
 
Machine Learning At Speed: Operationalizing ML For Real-Time Data Streams
Machine Learning At Speed: Operationalizing ML For Real-Time Data StreamsMachine Learning At Speed: Operationalizing ML For Real-Time Data Streams
Machine Learning At Speed: Operationalizing ML For Real-Time Data StreamsLightbend
 
DESIGNING AND ANALYZING GRID NODE JOB PROCESS SCHUDULING
DESIGNING AND ANALYZING GRID NODE JOB PROCESS SCHUDULINGDESIGNING AND ANALYZING GRID NODE JOB PROCESS SCHUDULING
DESIGNING AND ANALYZING GRID NODE JOB PROCESS SCHUDULINGChih-Ting Tsai
 
2014 IEEE JAVA CLOUD COMPUTING PROJECT A stochastic model to investigate data...
2014 IEEE JAVA CLOUD COMPUTING PROJECT A stochastic model to investigate data...2014 IEEE JAVA CLOUD COMPUTING PROJECT A stochastic model to investigate data...
2014 IEEE JAVA CLOUD COMPUTING PROJECT A stochastic model to investigate data...IEEEFINALSEMSTUDENTPROJECTS
 
Possible Worlds Explorer: Datalog & Answer Set Programming for the Rest of Us
Possible Worlds Explorer: Datalog & Answer Set Programming for the Rest of UsPossible Worlds Explorer: Datalog & Answer Set Programming for the Rest of Us
Possible Worlds Explorer: Datalog & Answer Set Programming for the Rest of UsBertram Ludäscher
 
mapReduce for machine learning
mapReduce for machine learning mapReduce for machine learning
mapReduce for machine learning Pranya Prabhakar
 
Hardback solution to accelerate multimedia computation through mgp in cmp
Hardback solution to accelerate multimedia computation through mgp in cmpHardback solution to accelerate multimedia computation through mgp in cmp
Hardback solution to accelerate multimedia computation through mgp in cmpeSAT Publishing House
 
Multilevel Hybrid Cognitive Load Balancing Algorithm for Private/Public Cloud...
Multilevel Hybrid Cognitive Load Balancing Algorithm for Private/Public Cloud...Multilevel Hybrid Cognitive Load Balancing Algorithm for Private/Public Cloud...
Multilevel Hybrid Cognitive Load Balancing Algorithm for Private/Public Cloud...IDES Editor
 

Ähnlich wie Pretzel: optimized Machine Learning framework for low-latency and high throughput workloads (20)

Pretzel: optimized Machine Learning framework for low-latency and high throug...
Pretzel: optimized Machine Learning framework for low-latency and high throug...Pretzel: optimized Machine Learning framework for low-latency and high throug...
Pretzel: optimized Machine Learning framework for low-latency and high throug...
 
Simulation of Heterogeneous Cloud Infrastructures
Simulation of Heterogeneous Cloud InfrastructuresSimulation of Heterogeneous Cloud Infrastructures
Simulation of Heterogeneous Cloud Infrastructures
 
Flink Forward Berlin 2017: Boris Lublinsky, Stavros Kontopoulos - Introducing...
Flink Forward Berlin 2017: Boris Lublinsky, Stavros Kontopoulos - Introducing...Flink Forward Berlin 2017: Boris Lublinsky, Stavros Kontopoulos - Introducing...
Flink Forward Berlin 2017: Boris Lublinsky, Stavros Kontopoulos - Introducing...
 
PRETZEL: Opening the Black Box of Machine Learning Prediction Serving Systems
PRETZEL: Opening the Black Box of Machine Learning Prediction Serving SystemsPRETZEL: Opening the Black Box of Machine Learning Prediction Serving Systems
PRETZEL: Opening the Black Box of Machine Learning Prediction Serving Systems
 
Art%3 a10.1186%2f1687 6180-2011-29
Art%3 a10.1186%2f1687 6180-2011-29Art%3 a10.1186%2f1687 6180-2011-29
Art%3 a10.1186%2f1687 6180-2011-29
 
2014 IEEE JAVA CLOUD COMPUTING PROJECT Scalable analytics for iaas cloud avai...
2014 IEEE JAVA CLOUD COMPUTING PROJECT Scalable analytics for iaas cloud avai...2014 IEEE JAVA CLOUD COMPUTING PROJECT Scalable analytics for iaas cloud avai...
2014 IEEE JAVA CLOUD COMPUTING PROJECT Scalable analytics for iaas cloud avai...
 
IEEE 2014 JAVA CLOUD COMPUTING PROJECTS Scalable analytics for iaa s cloud av...
IEEE 2014 JAVA CLOUD COMPUTING PROJECTS Scalable analytics for iaa s cloud av...IEEE 2014 JAVA CLOUD COMPUTING PROJECTS Scalable analytics for iaa s cloud av...
IEEE 2014 JAVA CLOUD COMPUTING PROJECTS Scalable analytics for iaa s cloud av...
 
CS 301 Computer ArchitectureStudent # 1 EID 09Kingdom of .docx
CS 301 Computer ArchitectureStudent # 1 EID 09Kingdom of .docxCS 301 Computer ArchitectureStudent # 1 EID 09Kingdom of .docx
CS 301 Computer ArchitectureStudent # 1 EID 09Kingdom of .docx
 
Scolari's ICCD17 Talk
Scolari's ICCD17 TalkScolari's ICCD17 Talk
Scolari's ICCD17 Talk
 
IEEE 2014 JAVA CLOUD COMPUTING PROJECTS A stochastic model to investigate dat...
IEEE 2014 JAVA CLOUD COMPUTING PROJECTS A stochastic model to investigate dat...IEEE 2014 JAVA CLOUD COMPUTING PROJECTS A stochastic model to investigate dat...
IEEE 2014 JAVA CLOUD COMPUTING PROJECTS A stochastic model to investigate dat...
 
2014 IEEE JAVA CLOUD COMPUTING PROJECT A stochastic model to investigate data...
2014 IEEE JAVA CLOUD COMPUTING PROJECT A stochastic model to investigate data...2014 IEEE JAVA CLOUD COMPUTING PROJECT A stochastic model to investigate data...
2014 IEEE JAVA CLOUD COMPUTING PROJECT A stochastic model to investigate data...
 
Machine Learning At Speed: Operationalizing ML For Real-Time Data Streams
Machine Learning At Speed: Operationalizing ML For Real-Time Data StreamsMachine Learning At Speed: Operationalizing ML For Real-Time Data Streams
Machine Learning At Speed: Operationalizing ML For Real-Time Data Streams
 
DESIGNING AND ANALYZING GRID NODE JOB PROCESS SCHUDULING
DESIGNING AND ANALYZING GRID NODE JOB PROCESS SCHUDULINGDESIGNING AND ANALYZING GRID NODE JOB PROCESS SCHUDULING
DESIGNING AND ANALYZING GRID NODE JOB PROCESS SCHUDULING
 
2014 IEEE JAVA CLOUD COMPUTING PROJECT A stochastic model to investigate data...
2014 IEEE JAVA CLOUD COMPUTING PROJECT A stochastic model to investigate data...2014 IEEE JAVA CLOUD COMPUTING PROJECT A stochastic model to investigate data...
2014 IEEE JAVA CLOUD COMPUTING PROJECT A stochastic model to investigate data...
 
Possible Worlds Explorer: Datalog & Answer Set Programming for the Rest of Us
Possible Worlds Explorer: Datalog & Answer Set Programming for the Rest of UsPossible Worlds Explorer: Datalog & Answer Set Programming for the Rest of Us
Possible Worlds Explorer: Datalog & Answer Set Programming for the Rest of Us
 
mapReduce for machine learning
mapReduce for machine learning mapReduce for machine learning
mapReduce for machine learning
 
FrackingPaper
FrackingPaperFrackingPaper
FrackingPaper
 
Machine Learning @NECST
Machine Learning @NECSTMachine Learning @NECST
Machine Learning @NECST
 
Hardback solution to accelerate multimedia computation through mgp in cmp
Hardback solution to accelerate multimedia computation through mgp in cmpHardback solution to accelerate multimedia computation through mgp in cmp
Hardback solution to accelerate multimedia computation through mgp in cmp
 
Multilevel Hybrid Cognitive Load Balancing Algorithm for Private/Public Cloud...
Multilevel Hybrid Cognitive Load Balancing Algorithm for Private/Public Cloud...Multilevel Hybrid Cognitive Load Balancing Algorithm for Private/Public Cloud...
Multilevel Hybrid Cognitive Load Balancing Algorithm for Private/Public Cloud...
 

Mehr von NECST Lab @ Politecnico di Milano

Embedding based knowledge graph link prediction for drug repurposing
Embedding based knowledge graph link prediction for drug repurposingEmbedding based knowledge graph link prediction for drug repurposing
Embedding based knowledge graph link prediction for drug repurposingNECST Lab @ Politecnico di Milano
 
PLASTER - PYNQ-based abandoned object detection using a map-reduce approach o...
PLASTER - PYNQ-based abandoned object detection using a map-reduce approach o...PLASTER - PYNQ-based abandoned object detection using a map-reduce approach o...
PLASTER - PYNQ-based abandoned object detection using a map-reduce approach o...NECST Lab @ Politecnico di Milano
 
EMPhASIS - An EMbedded Public Attention Stress Identification System
 EMPhASIS - An EMbedded Public Attention Stress Identification System EMPhASIS - An EMbedded Public Attention Stress Identification System
EMPhASIS - An EMbedded Public Attention Stress Identification SystemNECST Lab @ Politecnico di Milano
 
Maeve - Fast genome analysis leveraging exact string matching
Maeve - Fast genome analysis leveraging exact string matchingMaeve - Fast genome analysis leveraging exact string matching
Maeve - Fast genome analysis leveraging exact string matchingNECST Lab @ Politecnico di Milano
 

Mehr von NECST Lab @ Politecnico di Milano (20)

Mesticheria Team - WiiReflex
Mesticheria Team - WiiReflexMesticheria Team - WiiReflex
Mesticheria Team - WiiReflex
 
Punto e virgola Team - Stressometro
Punto e virgola Team - StressometroPunto e virgola Team - Stressometro
Punto e virgola Team - Stressometro
 
BitIt Team - Stay.straight
BitIt Team - Stay.straight BitIt Team - Stay.straight
BitIt Team - Stay.straight
 
BabYodini Team - Talking Gloves
BabYodini Team - Talking GlovesBabYodini Team - Talking Gloves
BabYodini Team - Talking Gloves
 
printf("Nome Squadra"); Team - NeoTon
printf("Nome Squadra"); Team - NeoTonprintf("Nome Squadra"); Team - NeoTon
printf("Nome Squadra"); Team - NeoTon
 
BlackBoard Team - Motion Tracking Platform
BlackBoard Team - Motion Tracking PlatformBlackBoard Team - Motion Tracking Platform
BlackBoard Team - Motion Tracking Platform
 
#include<brain.h> Team - HomeBeatHome
#include<brain.h> Team - HomeBeatHome#include<brain.h> Team - HomeBeatHome
#include<brain.h> Team - HomeBeatHome
 
Flipflops Team - Wave U
Flipflops Team - Wave UFlipflops Team - Wave U
Flipflops Team - Wave U
 
Bug(atta) Team - Little Brother
Bug(atta) Team - Little BrotherBug(atta) Team - Little Brother
Bug(atta) Team - Little Brother
 
#NECSTCamp: come partecipare
#NECSTCamp: come partecipare#NECSTCamp: come partecipare
#NECSTCamp: come partecipare
 
NECSTCamp101@2020.10.1
NECSTCamp101@2020.10.1NECSTCamp101@2020.10.1
NECSTCamp101@2020.10.1
 
NECSTLab101 2020.2021
NECSTLab101 2020.2021NECSTLab101 2020.2021
NECSTLab101 2020.2021
 
TreeHouse, nourish your community
TreeHouse, nourish your communityTreeHouse, nourish your community
TreeHouse, nourish your community
 
TiReX: Tiled Regular eXpressionsmatching architecture
TiReX: Tiled Regular eXpressionsmatching architectureTiReX: Tiled Regular eXpressionsmatching architecture
TiReX: Tiled Regular eXpressionsmatching architecture
 
Embedding based knowledge graph link prediction for drug repurposing
Embedding based knowledge graph link prediction for drug repurposingEmbedding based knowledge graph link prediction for drug repurposing
Embedding based knowledge graph link prediction for drug repurposing
 
PLASTER - PYNQ-based abandoned object detection using a map-reduce approach o...
PLASTER - PYNQ-based abandoned object detection using a map-reduce approach o...PLASTER - PYNQ-based abandoned object detection using a map-reduce approach o...
PLASTER - PYNQ-based abandoned object detection using a map-reduce approach o...
 
EMPhASIS - An EMbedded Public Attention Stress Identification System
 EMPhASIS - An EMbedded Public Attention Stress Identification System EMPhASIS - An EMbedded Public Attention Stress Identification System
EMPhASIS - An EMbedded Public Attention Stress Identification System
 
Luns - Automatic lungs segmentation through neural network
Luns - Automatic lungs segmentation through neural networkLuns - Automatic lungs segmentation through neural network
Luns - Automatic lungs segmentation through neural network
 
BlastFunction: How to combine Serverless and FPGAs
BlastFunction: How to combine Serverless and FPGAsBlastFunction: How to combine Serverless and FPGAs
BlastFunction: How to combine Serverless and FPGAs
 
Maeve - Fast genome analysis leveraging exact string matching
Maeve - Fast genome analysis leveraging exact string matchingMaeve - Fast genome analysis leveraging exact string matching
Maeve - Fast genome analysis leveraging exact string matching
 

Kürzlich hochgeladen

IVE Industry Focused Event - Defence Sector 2024
IVE Industry Focused Event - Defence Sector 2024IVE Industry Focused Event - Defence Sector 2024
IVE Industry Focused Event - Defence Sector 2024Mark Billinghurst
 
Risk Assessment For Installation of Drainage Pipes.pdf
Risk Assessment For Installation of Drainage Pipes.pdfRisk Assessment For Installation of Drainage Pipes.pdf
Risk Assessment For Installation of Drainage Pipes.pdfROCENODodongVILLACER
 
welding defects observed during the welding
welding defects observed during the weldingwelding defects observed during the welding
welding defects observed during the weldingMuhammadUzairLiaqat
 
The SRE Report 2024 - Great Findings for the teams
The SRE Report 2024 - Great Findings for the teamsThe SRE Report 2024 - Great Findings for the teams
The SRE Report 2024 - Great Findings for the teamsDILIPKUMARMONDAL6
 
Main Memory Management in Operating System
Main Memory Management in Operating SystemMain Memory Management in Operating System
Main Memory Management in Operating SystemRashmi Bhat
 
Research Methodology for Engineering pdf
Research Methodology for Engineering pdfResearch Methodology for Engineering pdf
Research Methodology for Engineering pdfCaalaaAbdulkerim
 
Internet of things -Arshdeep Bahga .pptx
Internet of things -Arshdeep Bahga .pptxInternet of things -Arshdeep Bahga .pptx
Internet of things -Arshdeep Bahga .pptxVelmuruganTECE
 
Introduction-To-Agricultural-Surveillance-Rover.pptx
Introduction-To-Agricultural-Surveillance-Rover.pptxIntroduction-To-Agricultural-Surveillance-Rover.pptx
Introduction-To-Agricultural-Surveillance-Rover.pptxk795866
 
Instrumentation, measurement and control of bio process parameters ( Temperat...
Instrumentation, measurement and control of bio process parameters ( Temperat...Instrumentation, measurement and control of bio process parameters ( Temperat...
Instrumentation, measurement and control of bio process parameters ( Temperat...121011101441
 
US Department of Education FAFSA Week of Action
US Department of Education FAFSA Week of ActionUS Department of Education FAFSA Week of Action
US Department of Education FAFSA Week of ActionMebane Rash
 
Steel Structures - Building technology.pptx
Steel Structures - Building technology.pptxSteel Structures - Building technology.pptx
Steel Structures - Building technology.pptxNikhil Raut
 
NO1 Certified Black Magic Specialist Expert Amil baba in Uae Dubai Abu Dhabi ...
NO1 Certified Black Magic Specialist Expert Amil baba in Uae Dubai Abu Dhabi ...NO1 Certified Black Magic Specialist Expert Amil baba in Uae Dubai Abu Dhabi ...
NO1 Certified Black Magic Specialist Expert Amil baba in Uae Dubai Abu Dhabi ...Amil Baba Dawood bangali
 
Vishratwadi & Ghorpadi Bridge Tender documents
Vishratwadi & Ghorpadi Bridge Tender documentsVishratwadi & Ghorpadi Bridge Tender documents
Vishratwadi & Ghorpadi Bridge Tender documentsSachinPawar510423
 
Energy Awareness training ppt for manufacturing process.pptx
Energy Awareness training ppt for manufacturing process.pptxEnergy Awareness training ppt for manufacturing process.pptx
Energy Awareness training ppt for manufacturing process.pptxsiddharthjain2303
 
Correctly Loading Incremental Data at Scale
Correctly Loading Incremental Data at ScaleCorrectly Loading Incremental Data at Scale
Correctly Loading Incremental Data at ScaleAlluxio, Inc.
 
Solving The Right Triangles PowerPoint 2.ppt
Solving The Right Triangles PowerPoint 2.pptSolving The Right Triangles PowerPoint 2.ppt
Solving The Right Triangles PowerPoint 2.pptJasonTagapanGulla
 
Work Experience-Dalton Park.pptxfvvvvvvv
Work Experience-Dalton Park.pptxfvvvvvvvWork Experience-Dalton Park.pptxfvvvvvvv
Work Experience-Dalton Park.pptxfvvvvvvvLewisJB
 
TechTAC® CFD Report Summary: A Comparison of Two Types of Tubing Anchor Catchers
TechTAC® CFD Report Summary: A Comparison of Two Types of Tubing Anchor CatchersTechTAC® CFD Report Summary: A Comparison of Two Types of Tubing Anchor Catchers
TechTAC® CFD Report Summary: A Comparison of Two Types of Tubing Anchor Catcherssdickerson1
 

Kürzlich hochgeladen (20)

young call girls in Rajiv Chowk🔝 9953056974 🔝 Delhi escort Service
young call girls in Rajiv Chowk🔝 9953056974 🔝 Delhi escort Serviceyoung call girls in Rajiv Chowk🔝 9953056974 🔝 Delhi escort Service
young call girls in Rajiv Chowk🔝 9953056974 🔝 Delhi escort Service
 
IVE Industry Focused Event - Defence Sector 2024
IVE Industry Focused Event - Defence Sector 2024IVE Industry Focused Event - Defence Sector 2024
IVE Industry Focused Event - Defence Sector 2024
 
Risk Assessment For Installation of Drainage Pipes.pdf
Risk Assessment For Installation of Drainage Pipes.pdfRisk Assessment For Installation of Drainage Pipes.pdf
Risk Assessment For Installation of Drainage Pipes.pdf
 
welding defects observed during the welding
welding defects observed during the weldingwelding defects observed during the welding
welding defects observed during the welding
 
The SRE Report 2024 - Great Findings for the teams
The SRE Report 2024 - Great Findings for the teamsThe SRE Report 2024 - Great Findings for the teams
The SRE Report 2024 - Great Findings for the teams
 
Main Memory Management in Operating System
Main Memory Management in Operating SystemMain Memory Management in Operating System
Main Memory Management in Operating System
 
Research Methodology for Engineering pdf
Research Methodology for Engineering pdfResearch Methodology for Engineering pdf
Research Methodology for Engineering pdf
 
Internet of things -Arshdeep Bahga .pptx
Internet of things -Arshdeep Bahga .pptxInternet of things -Arshdeep Bahga .pptx
Internet of things -Arshdeep Bahga .pptx
 
Introduction-To-Agricultural-Surveillance-Rover.pptx
Introduction-To-Agricultural-Surveillance-Rover.pptxIntroduction-To-Agricultural-Surveillance-Rover.pptx
Introduction-To-Agricultural-Surveillance-Rover.pptx
 
Instrumentation, measurement and control of bio process parameters ( Temperat...
Instrumentation, measurement and control of bio process parameters ( Temperat...Instrumentation, measurement and control of bio process parameters ( Temperat...
Instrumentation, measurement and control of bio process parameters ( Temperat...
 
US Department of Education FAFSA Week of Action
US Department of Education FAFSA Week of ActionUS Department of Education FAFSA Week of Action
US Department of Education FAFSA Week of Action
 
Steel Structures - Building technology.pptx
Steel Structures - Building technology.pptxSteel Structures - Building technology.pptx
Steel Structures - Building technology.pptx
 
NO1 Certified Black Magic Specialist Expert Amil baba in Uae Dubai Abu Dhabi ...
NO1 Certified Black Magic Specialist Expert Amil baba in Uae Dubai Abu Dhabi ...NO1 Certified Black Magic Specialist Expert Amil baba in Uae Dubai Abu Dhabi ...
NO1 Certified Black Magic Specialist Expert Amil baba in Uae Dubai Abu Dhabi ...
 
Vishratwadi & Ghorpadi Bridge Tender documents
Vishratwadi & Ghorpadi Bridge Tender documentsVishratwadi & Ghorpadi Bridge Tender documents
Vishratwadi & Ghorpadi Bridge Tender documents
 
Energy Awareness training ppt for manufacturing process.pptx
Energy Awareness training ppt for manufacturing process.pptxEnergy Awareness training ppt for manufacturing process.pptx
Energy Awareness training ppt for manufacturing process.pptx
 
Correctly Loading Incremental Data at Scale
Correctly Loading Incremental Data at ScaleCorrectly Loading Incremental Data at Scale
Correctly Loading Incremental Data at Scale
 
Solving The Right Triangles PowerPoint 2.ppt
Solving The Right Triangles PowerPoint 2.pptSolving The Right Triangles PowerPoint 2.ppt
Solving The Right Triangles PowerPoint 2.ppt
 
Work Experience-Dalton Park.pptxfvvvvvvv
Work Experience-Dalton Park.pptxfvvvvvvvWork Experience-Dalton Park.pptxfvvvvvvv
Work Experience-Dalton Park.pptxfvvvvvvv
 
young call girls in Green Park🔝 9953056974 🔝 escort Service
young call girls in Green Park🔝 9953056974 🔝 escort Serviceyoung call girls in Green Park🔝 9953056974 🔝 escort Service
young call girls in Green Park🔝 9953056974 🔝 escort Service
 
TechTAC® CFD Report Summary: A Comparison of Two Types of Tubing Anchor Catchers
TechTAC® CFD Report Summary: A Comparison of Two Types of Tubing Anchor CatchersTechTAC® CFD Report Summary: A Comparison of Two Types of Tubing Anchor Catchers
TechTAC® CFD Report Summary: A Comparison of Two Types of Tubing Anchor Catchers
 

Pretzel: optimized Machine Learning framework for low-latency and high throughput workloads

  • 1. Speaker: Alberto Scolari, PhD student @ Politecnico di Milano, Italy - alberto.scolari@polimi.it Oracle Labs, Belmont, 05/30/2018 Yunseong Lee, Alberto Scolari, Matteo Interlandi, Markus Weimer, Marco D. Santambrogio, Byung-Gon Chun PRETZEL: Opening the Black Box of Machine Learning Prediction Serving Systems
  • 2. ML-as-a-Service • ML models are deployed on cloud platforms as black boxes • Users deploy multiple models per machine (10-100s) 2 • Deployed models are often similar - Similar structure - Similar state • Two key requirements: 1. Performance: latency or throughput 2. Model density: number of models per machine good:0 bad:1 … This is a good product good:0 bad:1 … This is a good product
  • 3. We need to know structure and state: white-box model 3 Breaking the black-box model 1. To generate an optimised version of a model on deployment: higher performance 2. To allocate shared state only once, and share among models: higher density
  • 4. ‣ Opening the black box ‣ Pretzel, white-box serving system ‣ Evaluation ‣ Conclusions and future work 4 Outline
  • 6. 6 Case study • 300 models in ML.NET [1], C#, from Amazon Reviews dataset [2]: 250 Sentiment Analysis + 50 Regression Task • Analyzed and accelerated in previous work [3] • Most of the time is spent in featurization • No single bottleneck PRETZEL: Opening the Black Box of Machine Learning Prediction Serving Systems Paper # 355 Abstract Machine Learning models are often composed of pipelines of transformations. While this design allows to efficiently execute single model components at training- time, prediction serving has different requirements such as low latency, high throughput and graceful performance degradation under heavy load. Current prediction serv- ing systems consider models as black boxes, whereby prediction-time-specific optimizations are ignored in fa- vor of ease of deployment. In this paper, we present PRETZEL, a prediction serving system introducing a novel white box architecture enabling both end-to-end and multi-model optimizations. Using production-like model pipelines, our experiments show that PRETZEL is able to introduce performance improvements over differ- ent dimensions; compared to state-of-the-art approaches PRETZEL is able to reduce latency by up to 100⇥ while reducing memory footprint by up to 7⇥, and increasing throughput by up to 2.5⇥. 1 Introduction Many Machine Learning (ML) frameworks such as Google TensorFlow [2], Facebook Caffe2 [4], Scikit- learn [41], or CompanyX’s Internal ML Library (ML2) allow data scientists to declaratively author pipelines of transformations to train models from large-scale input datasets. Model pipelines are internally represented as Directed Acyclic Graphs (DAGs) of operators comprising data transformations and featurizers (e.g., string tokeniza- tion, hashing, etc.), and ML models (e.g., decision trees, linear models, SVMs, etc.). Figure 1 shows an example pipeline for text analysis whereby input sentences are classified according to the expressed sentiment. Machine Learning is usually conceptualized as a two- steps process: first, during training model parameters are estimated from large datasets by running computation- ally intensive iterative algorithms; successively, trained pipelines are used for inference to generate predictions through the estimated model parameters. When trained “This is a nice product” Positive vs. Negative Tokenizer Char Ngram Word Ngram Concat Linear Regression Figure 1: A sentimental analysis pipeline consisting of operators for featurization (ellipses), followed by a ML model (diamond). Tokenizer extracts tokens (e.g., words) from the input string. Char and Word Ngrams featurize input tokens by extracting n-grams. Concat generates a unique feature vector which is then scored by a Linear Regression predictor. This is a simplification: the actual DAG contains about 12 operators. pipelines are served for inference, the full set of operators is deployed altogether.However, pipelines have different system characteristics based on the phase in which they are employed: for instance, at training time ML models run complex algorithms to scale over large datasets (e.g., linear models can use gradient descent in one of its many flavors [43, 42, 44]), while, once trained, they behave as other regular featurizers and data transformations; further- more, during inference pipelines are often surfaced for direct users’ servicing and therefore require low latency, high throughput, and graceful degradation of performance in case of load spikes. Existing prediction serving systems, such as Clip- per [7, 25], TensorFlow Serving [3, 39], Rafiki [49], ML2 itself, and others [13, 14, 36, 12] focus mainly on ease of deployment, where pipelines are considered as black boxes and deployed into containers (e.g., Docker [9] in Clipper and Rafiki, servables in TensorFlow Serving). Under this strategy, only “pipeline-agnostic” optimiza- tions such as caching, batching and buffering are avail- able. Nevertheless, we found that black box approaches fell short on several aspects. For instance, prediction ser- 1 Fig. 2. Execution breakdown of the example model 2 THE CASE FOR ACCELERATING GENERIC PRE- DICTION PIPELINES Nowadays, several “intelligent” services such as Microsoft Cortana speech recognition, Netflix movie recommender or Gmail spam detector depend on ML model scoring capabil- ities, and are currently experiencing a growing demand that in turn fosters the research on their acceleration. While efforts exist that try to accelerate specific types of models (e.g., Brainwave for deep neural networks) many models we actually see in production are often generic and composed of several (tens of) different transformations. Many workloads run in a prediction-serving, cloud-like setting, where prediction models coming from data science experts are operationalized. While data scientists prefer to use high-level declarative tools such as IMLT for better productivity, operationalized models require low latency, high throughput, and highly predictable performance. By separating the platform where models are authored (and trained) from the one where models are operationalized, we can make simplifying assumptions on the latter: • models are pre-trained and do not change during execution; 1 • models from the same user likely share several transformations, or, in case such transformations are stateful (and immutable), they likely share part of the state; in some cases (for example a tokenization process on a common English vocabulary) the state sharing may hold even beyond single users where the prediction (the LinReg transformation very small amount of time compared to other com (0.3% of the execution time). Instead, the n-gram mations take up almost 60% of the execution tim another 33% is spent in concatenating the feature which consists in allocating memory and copying summary, most of the execution time is lost in p the data for scoring, not in computing the predictio workloads, we observed that this situation is co many models, because many tasks employ simpl tion models like linear regression functions, both b their prediction latency, and because of their simp understandability. In these scenarios, we can conc the acceleration of an ML pipeline cannot focus specific transformations, as in the case of neural where the bulk of computation is spent in scoring, b a more systematic approach. 3 CONSIDERATIONS AND SYSTEM IMPLE TION Following the observations of Section 2, we ar acceleration of generic ML prediction pipelines ble if we optimize models end-to-end instead transformations, i.e., while data scientist focus level “logical” transformations, the operationalizat model for scoring should focus on execution unit optimal decisions on how data is processed throug pipelines. We call such execution units stages, borro term from the database and system communities stage is a sequence of one or more transformation executed together on a device (either a CPU or a in this work), and represents the “atomic” unit of putation. No data movement from the computing another occurs within a stage, while communica occur between two stages running on different dev the PCIe communication when offloading to FPGA [2] R. He and J. McAuley. Ups and downs: Modeling the visual evolution of fashion trends with one-class collaborative filtering. In Proceedings of the 25th International Conference on World Wide Web, WWW ’16 [3] A. Scolari, Y. Lee, M. Weimer and M. Interlandi, "Towards Accelerating Generic Machine Learning Prediction Pipelines," 2017 IEEE International Conference on Computer Design (ICCD)
  • 7. • Overheads are a remarkable % of runtime - JIT, GC, virtual calls, … • Operators have different buffers and break locality • Operators are executed one at a time: no code fusion, multiple data accesses 7 Limitations of black-box(a) We identify the operators by their parameters. The first two groups represent N-gram operators, which have multiple versions with different length (e.g., unigram, trigram) (b) 3 different ML models are use each with distinct parameters. Figure 3: Probability for an operator within the 250 different pipelines. If an operator x has the probability y%, the 250⇥ y 100 pipelines have operator x, implying that pipelines can save memory by sharing one instance of operator x Figure 4: CDF of latency of prediction requests of 250 DAGs. We denote the first prediction as cold; the hot line is reported as average over 100 predictions after a warm-up period of 10. The plot is normalized over the 99th percentile latency of the hot case. describes this situation, where the performance of hot predictions over the 250 sentiment analysis pipelines with memory already allocated and JIT-compiled code is more than an order-of-magnitude faster then the cold version for the same pipelines. To drill down more into the prob- lem, we found that 57.4% of the total execution time for a single cold prediction is spent in pipeline analysis and ini- tialization of the function chain, 36.5% in JIT compilation and the remaining is actual computation time. Infrequent Accesses: In order to meet milliseconds-level latencies [51], model pipelines have to reside in main memory (possibly already warmed-up), since they can have MBs to GBs (compressed) size on disk, with loading and initialization times easily exceeding several seconds. A common practice in production settings is to unload a pipeline if not accessed after a certain period of time (e.g., a few hours). Once evicted, successive accesses will incur a model loading penalty and warming-up, therefore violating Service Level Agreement (SLA). Operator-at-a-time Model: As previously described, predictions over ML2 pipelines are computed by pullin records through a sequence of operators, each of the transforming to the input vector(s) and producing on or more new vectors. While (as is common practice f in-memory data-intensive systems [38, 48, 18]) some i terpretation overheads are eliminated via JIT compilatio operators in ML2 (and in other tools) are “logical” entiti (e.g., linear regression, tokenizer, one-hot encoder, etc with diverse performance characteristics. Figure 5 show the latency breakdown of one execution of the sentime analysis pipeline of Figure 1, where the only ML oper tor (linear regression) takes two orders-of-magnitude le time with respect to the slowest operator (WordNgram It is common practice for in-memory data-intensive sy tems to pipeline operators in order to minimize memo accesses for memory-intensive workloads, and to ve torize compute intensive operators in order to minimiz the number of instructions per data item [26, 56]. ML operator-at-a-time model [56] (as other libraries missin an optimization layer, such as Scikit-learn) is therefo sub-optimal in that computation is organized around lo ical operators, ignoring how those operators behave t gether: in the example of the sentiment analysis pipelin at hand, linear regression is commutative and associativ (e.g., dot product between vectors) and can be pipeline with Char and WordNgram, eliminating the need for th Concat operation. As we will see in following Section PRETZEL optimizer is able to detect this situation and generate an execution plan that is several times faster wi regard to ML2 unoptimized version of the pipeline. Coarse Grained Scheduling: Scheduling CPU r sources carefully is essential to serve highly concurre requests and run machines to maximum utilization. Und the black box approach: (1) a thread pool is used to serv multiple concurrent requests to the same model pipelin (2) for each request, one thread handles the execution 4 (a) We identify the operators by their parameters. The first two groups represent N-gram operators, which have multiple versions with different length (e.g., unigram, trigram) (b) 3 different ML models are used, each with distinct parameters. Figure 3: Probability for an operator within the 250 different pipelines. If an operator x has the probability y%, then 250⇥ y 100 pipelines have operator x, implying that pipelines can save memory by sharing one instance of operator x. Figure 4: CDF of latency of prediction requests of 250 DAGs. We denote the first prediction as cold; the hot line is reported as average over 100 predictions after a warm-up period of 10. The plot is normalized over the 99th percentile latency of the hot case. describes this situation, where the performance of hot predictions over the 250 sentiment analysis pipelines with memory already allocated and JIT-compiled code is more than an order-of-magnitude faster then the cold version for the same pipelines. To drill down more into the prob- lem, we found that 57.4% of the total execution time for a single cold prediction is spent in pipeline analysis and ini- tialization of the function chain, 36.5% in JIT compilation and the remaining is actual computation time. Infrequent Accesses: In order to meet milliseconds-level latencies [51], model pipelines have to reside in main memory (possibly already warmed-up), since they can have MBs to GBs (compressed) size on disk, with loading and initialization times easily exceeding several seconds. A common practice in production settings is to unload a pipeline if not accessed after a certain period of time (e.g., a few hours). Once evicted, successive accesses will incur a model loading penalty and warming-up, therefore violating Service Level Agreement (SLA). Operator-at-a-time Model: As previously described, predictions over ML2 pipelines are computed by pulling records through a sequence of operators, each of them transforming to the input vector(s) and producing one or more new vectors. While (as is common practice for in-memory data-intensive systems [38, 48, 18]) some in- terpretation overheads are eliminated via JIT compilation, operators in ML2 (and in other tools) are “logical” entities (e.g., linear regression, tokenizer, one-hot encoder, etc.) with diverse performance characteristics. Figure 5 shows the latency breakdown of one execution of the sentiment analysis pipeline of Figure 1, where the only ML opera- tor (linear regression) takes two orders-of-magnitude less time with respect to the slowest operator (WordNgram). It is common practice for in-memory data-intensive sys- tems to pipeline operators in order to minimize memory accesses for memory-intensive workloads, and to vec- torize compute intensive operators in order to minimize the number of instructions per data item [26, 56]. ML2 operator-at-a-time model [56] (as other libraries missing an optimization layer, such as Scikit-learn) is therefore sub-optimal in that computation is organized around log- ical operators, ignoring how those operators behave to- gether: in the example of the sentiment analysis pipeline at hand, linear regression is commutative and associative (e.g., dot product between vectors) and can be pipelined with Char and WordNgram, eliminating the need for the Concat operation. As we will see in following Sections, PRETZEL optimizer is able to detect this situation and to generate an execution plan that is several times faster with regard to ML2 unoptimized version of the pipeline. Coarse Grained Scheduling: Scheduling CPU re- sources carefully is essential to serve highly concurrent requests and run machines to maximum utilization. Under the black box approach: (1) a thread pool is used to serve multiple concurrent requests to the same model pipeline; (2) for each request, one thread handles the execution of • No state sharing: state is duplicated -> memory is wasted Limited performance Limited density
  • 8. • Optimisations for single operators, like DNNs [4-5] • TensorFlow Serving [6] as Servable Python objects • ML.NET as zip files with state files and DLLs • Clipper [7] and Rafiki [8] deploy pipelines as Docker containers – They schedule requests based on latency target – Can apply caching and batching • MauveDB [9] accepts regression and interpolation model and optimises them as DB views 8 Related work [4] https://www.microsoft.com/en-us/research/blog/microsoft-unveils-project-brainwave/ [5] In-Datacenter Performance Analysis of a Tensor Processing Unit, arXiv, Apr 2017 [6] C.Olston, et al., and V. Rajashekhar. Tensorflow - serving: Flexible, high-performance ml serving. In Work-shop on ML Systems at NIPS, 2017 [7] D. Crankshaw, et al.. Clipper: A low-latency online prediction serving system. In NSDI, 2017 [8] W.Wang,S.Wang,J.Gao, et al.. Rafiki: Machine Learning as an Analytics Service System. ArXiv e-prints, Apr. 2018 [9] A. Deshpande and S. Madden. Mauvedb: Supporting model-based user views in database systems. In SIGMOD, 2006
  • 10. White Box Prediction Serving: make pipelines co-exist better, schedule better 1. End-to-end Optimizations: merge operators in computational units, to decrease overheads 2. Multi-model Optimizations: create once, use everywhere, for both data and stages 10 Design principles
  • 11. SELECT * FROM Model2 WHERE … JOIN WORD-NGRAM 11 Afterthought: scoring as querying good:0 bad:1 … This is a good product WORD-NGRAM “This is a good product”
  • 12. 12 Off-line phase - Flour+Ovenages the dynamic decisions on how to schedule plans based on machine workload. Finally, a FrontEnd is used to submit prediction requests to the system. Model pipeline deployment and serving in PRETZEL follow a two-phase process. During the off-line phase (Section 4.1), ML2’s pre-trained pipelines are translated into Flour transformation-based API. Oven optimizer re- arranges and fuses transformations into model plans com- posed of parameterized logical units called stages. Each logical stage is then AOT compiled into physical computa- tion units where memory resources and threads are pooled at runtime. Model plans are registered for prediction serv- ing in the Runtime where physical stages and parameters are shared between pipelines with similar model plans. In the on-line phase (Section 4.2), when an inference request for a registered model plan is received, physical stages are parameterized dynamically with the proper values main- tained in the Object Store. The Scheduler is in charge of binding physical stages to shared execution units. Figures 6 and 7 pictorially summarize the above de- scriptions; note that only the on-line phase is executed at inference time, whereas the model plans are generated completely off-line. Next, we will describe each layer composing the PRETZEL prediction system. 4.1 Off-line Phase 4.1.1 Flour The goal of Flour is to provide an intermediate represen- tation between ML frameworks (currently only ML2) and PRETZEL that is both easy to target and amenable to op- timizations. Once a pipeline is ported into Flour, it can be optimized and compiled (Section 4.1.2) into a model plan before getting fed into PRETZEL Runtime for on- line scoring. Flour is a language-integrated API similar to KeystoneML [45], RDDs [53] or LINQ [35] where sequences of transformations are chained into DAGs and lazily compiled for execution. Listing 1 shows how the sentiment analysis pipeline of Figure 1 can be expressed in Flour. Flour programs are composed by transformations where a one-to-many map- ping exists between ML2 operators and Flour transforma- tions (i.e., one operator in ML2 can be mapped to many transformations in Flour). Each Flour program starts from a FlourContext object wrapping the Object Store. Subsequent method calls define a DAG of transforma- tions, which will end with a call to plan to instantiate the model plan before feeding it into PRETZEL Runtime. For example, in line 3 of Figure 1 the CSV.FromText call is used to specify that the target DAG accepts as in- put text in csv format where fields and fieldsType arrays indicate the number and type of input fields. The successive call to Tokenize in line 4 splits the input fields into tokens. Lines 6 and 7 contain the two branches defining the char-level and word-level n-gram transforma- tions, which are then merged with the Concat transform in line 9 before the linear binary classifier of line 10. Both char and word n-gram transformations are parametrized by the number of n-grams and maps translating n-grams into numerical format (not shown in the Figure). Addi- tionally, each Flour transformation accepts as input an optional set of statistics gathered from training. These statistics are used by the compiler to generate physical plans more efficiently tailored to the model characteristics. Example statistics are max vector size (to define the mini- mum size of vectors to fetch from the pool at prediction time 4.2), dense or sparse representations, etc. Listing 1: Flour program for the sentiment analysis pipeline. Transformations’ parameters are extracted from the original ML2 pipeline. 1 var fContext = new FlourContext(objectStore, ...) 2 var tTokenizer = fContext.CSV. 3 FromText(fields, fieldsType, sep). 4 Tokenize(); 5 6 var tCNgram = tTokenizer.CharNgram(numCNgrms, ...); 7 var tWNgram = tTokenizer.WordNgram(numWNgrms, ...); 8 var fPrgrm = tCNgram. 9 Concat(tWNgram). 10 ClassifierBinaryLinear(cParams); 11 12 return fPrgrm.Plan(); We have instrumented the ML2 library to collect statis- tics from training and with the related bindings to the Object Store and Flour to automatically extract Flour programs from pipelines once trained. 4.1.2 Oven With Oven, our goal is to bring query compilation and optimization techniques into ML2. Optimizer: When plan is called on a Flour transforma- tion’s reference (e.g., fProgram in line 12 of Figure 1), all transformations leading to it are wrapped and analyzed. Oven follows the typical rule-based database optimizer de- sign where operator graphs (query plans) are transformed by a set of rules until a fixed point is reached (i.e., the graph does not change after the application of any rule). The goal of Oven Optimizer is to transform an input graph of Flour transformations into a stage graph, where each stage contains one or more transformations. To group transformations into stages we used the Tupleware’s hy- brid approach [26]: memory-intensive transformations 6 PRETZEL: Opening the Black Box of Machine Learning Prediction Serving Systems Paper # 355 Abstract Machine Learning models are often composed of pipelines of transformations. While this design allows to efficiently execute single model components at training- time, prediction serving has different requirements such as low latency, high throughput and graceful performance degradation under heavy load. Current prediction serv- ing systems consider models as black boxes, whereby prediction-time-specific optimizations are ignored in fa- vor of ease of deployment. In this paper, we present PRETZEL, a prediction serving system introducing a novel white box architecture enabling both end-to-end and multi-model optimizations. Using production-like model pipelines, our experiments show that PRETZEL is able to introduce performance improvements over differ- ent dimensions; compared to state-of-the-art approaches PRETZEL is able to reduce latency by up to 100⇥ while reducing memory footprint by up to 7⇥, and increasing throughput by up to 2.5⇥. 1 Introduction Many Machine Learning (ML) frameworks such as Google TensorFlow [2], Facebook Caffe2 [4], Scikit- learn [41], or CompanyX’s Internal ML Library (ML2) allow data scientists to declaratively author pipelines of transformations to train models from large-scale input datasets. Model pipelines are internally represented as Directed Acyclic Graphs (DAGs) of operators comprising data transformations and featurizers (e.g., string tokeniza- tion, hashing, etc.), and ML models (e.g., decision trees, linear models, SVMs, etc.). Figure 1 shows an example pipeline for text analysis whereby input sentences are classified according to the expressed sentiment. Machine Learning is usually conceptualized as a two- steps process: first, during training model parameters are estimated from large datasets by running computation- ally intensive iterative algorithms; successively, trained pipelines are used for inference to generate predictions through the estimated model parameters. When trained “This is a nice product” Positive vs. Negative Tokenizer Char Ngram Word Ngram Concat Linear Regression Figure 1: A sentimental analysis pipeline consisting of operators for featurization (ellipses), followed by a ML model (diamond). Tokenizer extracts tokens (e.g., words) from the input string. Char and Word Ngrams featurize input tokens by extracting n-grams. Concat generates a unique feature vector which is then scored by a Linear Regression predictor. This is a simplification: the actual DAG contains about 12 operators. pipelines are served for inference, the full set of operators is deployed altogether.However, pipelines have different system characteristics based on the phase in which they are employed: for instance, at training time ML models run complex algorithms to scale over large datasets (e.g., linear models can use gradient descent in one of its many flavors [43, 42, 44]), while, once trained, they behave as other regular featurizers and data transformations; further- more, during inference pipelines are often surfaced for direct users’ servicing and therefore require low latency, high throughput, and graceful degradation of performance in case of load spikes. Existing prediction serving systems, such as Clip- per [7, 25], TensorFlow Serving [3, 39], Rafiki [49], ML2 itself, and others [13, 14, 36, 12] focus mainly on ease of deployment, where pipelines are considered as black boxes and deployed into containers (e.g., Docker [9] in Clipper and Rafiki, servables in TensorFlow Serving). Under this strategy, only “pipeline-agnostic” optimiza- tions such as caching, batching and buffering are avail- able. Nevertheless, we found that black box approaches fell short on several aspects. For instance, prediction ser- 1 var fContext = ...; var Tokenizer = ...; return fPrgm.Plan(); (1) Flour Transforms Logical Stages S1 S2 S3 1: [x] 2: [y,z] 3: … int[100] float[200] … Params Stats Physical Stages S1 S2 S3 (3) Compilation (2) Optimization Model Stats Params Logical Stages Physical Stages Model Plan Figure 6: Model optimization and compilation in PRET- ZEL. In (1), a model is translated into a Flour program. (2) Oven Optimizer generates a DAG of logical stages from the program. Additionally, parameters and statistics are extracted. (3) A DAG of physical stages is generated by the Oven Compiler using logical stages, parameters, and statistics. A model plan is the union of all the elements and is fed to the runtime. (such as most featurizers) are pipelined together in a sin- gle pass over the data. This strategy achieves best data locality because records are likely to reside in CPU reg- isters [33, 38]. Compute-intensive transformations (e.g., vectors or matrices multiplications) are executed one-at- a-time so that Single Instruction, Multiple Data (SIMD) vectorization can be exploited, therefore optimizing the number of instructions per record [56, 20]. Stages are gen- erated by traversing the Flour transformation graph until a pipeline-breaking transformation is found: all transforma- tions up to that point are then grouped into a single stage. Within each stage, compute-bound operations are vec- torized. This strategy allows PRETZEL to considerably decrease the number of vectors (and as a consequence memory) with respect to the operator-at-a-time strategy of ML2. Dependencies between stages are created as ag- gregation of the dependencies between the internal trans- recognize that the Linear Regression can be pushed into Char and WordNgram, therefore bypassing the execution of Concat. Additionally, Tokenizer can be reused between Char and WordNgram, therefore it will be pipelined with CharNgram (in one stage) and a dependency between CharNgram and WordNGram (in another stage) will be created. The final plan will therefore be composed by 2 stages, versus the initial 4 operators (and vectors) of ML2. Model Plan Compiler: Model plans are composed by two DAGs: a DAG composed of logical stages, and a DAG of physical stages. Logical stages are an abstrac- tion of the stages output of the Oven Optimizer, with re- lated parameters; physical stages contains the actual code that will be executed by the PRETZEL runtime. For.each given DAG, there is a 1-to-n mapping between logical to physical stages so that a logical stage can represent the execution code of different physical implementations. A physical implementation is selected based on the parame- ters characterizing a logical stage and available statistics. Plan compilation is a two step process. After the stage DAG is generated by the Oven Optimizer, the Model Plan Compiler (MPC) maps each stage into its logical representation containing all the parameters for the trans- formations composing the original stage generated by the optimizer. Parameters are saved for reuse in the Object Store (Section 4.1.3). Once the logical plan is generated, MPC traverses the DAG in topological order and maps each logical stage into a physical implementation. Phys- ical implementations are AOT-compiled, parameterized, lock-free computation units. Each physical stage can be seen as a parametric function which will be dynamically fed at runtime with the proper data vectors and pipeline- specific parameters. This design allows PRETZEL runtime to share the same physical implementation between mul- tiple pipelines and no memory allocation occurs on the prediction-path (more details in Section 4.2.1). Logical plans maintain the mapping between the pipeline-specific parameters saved in the Object Store and the physical stages executing on the runtime. Figure 6 summarized the process of generating model plans out of ML2 pipelines. var fContext = ...; var Tokenizer = ...; return fPrgm.Plan(); (1) Flour Transforms Logical Stages S1 S2 S3 1: [x] 2: [y,z] 3: … int[100] float[200] … Params Stats Physical Stages S1 S2 S3 (3) Compilation (2) Optimization Model Stats Params Logical Stages Physical Stages Model Plan Figure 6: Model optimization and compilation in PRET- ZEL. In (1), a model is translated into a Flour program. (2) Oven Optimizer generates a DAG of logical stages from the program. Additionally, parameters and statistics are extracted. (3) A DAG of physical stages is generated by the Oven Compiler using logical stages, parameters, and statistics. A model plan is the union of all the elements and is fed to the runtime. (such as most featurizers) are pipelined together in a sin- gle pass over the data. This strategy achieves best data locality because records are likely to reside in CPU reg- isters [33, 38]. Compute-intensive transformations (e.g., vectors or matrices multiplications) are executed one-at- a-time so that Single Instruction, Multiple Data (SIMD) vectorization can be exploited, therefore optimizing the number of instructions per record [56, 20]. Stages are gen- erated by traversing the Flour transformation graph until a pipeline-breaking transformation is found: all transforma- tions up to that point are then grouped into a single stage. Within each stage, compute-bound operations are vec- torized. This strategy allows PRETZEL to considerably decrease the number of vectors (and as a consequence memory) with respect to the operator-at-a-time strategy of ML2. Dependencies between stages are created as ag- gregation of the dependencies between the internal trans- formations. Transformation classes are annotated (e.g., 1-to-1, 1-to-n, memory-bound, compute-bound, commu- tative and associative) to ease the optimization process: no dynamic compilation [26] is necessary since the set of operators is fixed and manual annotation is sufficient to generate properly optimized plans 4. In the example sentiment analysis pipeline of Figure 1, Oven is able to 4Note that ML2 does provide a second order operator accepting arbitrary code requiring dynamic compilation. However, this is not supported in our current version of PRETZEL. recognize that the Linear Regression can be pushed into Char and WordNgram, therefore bypassing the execution of Concat. Additionally, Tokenizer can be reused between Char and WordNgram, therefore it will be pipelined with CharNgram (in one stage) and a dependency between CharNgram and WordNGram (in another stage) will be created. The final plan will therefore be composed by 2 stages, versus the initial 4 operators (and vectors) of ML2. Model Plan Compiler: Model plans are composed by two DAGs: a DAG composed of logical stages, and a DAG of physical stages. Logical stages are an abstrac- tion of the stages output of the Oven Optimizer, with re- lated parameters; physical stages contains the actual code that will be executed by the PRETZEL runtime. For.each given DAG, there is a 1-to-n mapping between logical to physical stages so that a logical stage can represent the execution code of different physical implementations. A physical implementation is selected based on the parame- ters characterizing a logical stage and available statistics. Plan compilation is a two step process. After the stage DAG is generated by the Oven Optimizer, the Model Plan Compiler (MPC) maps each stage into its logical representation containing all the parameters for the trans- formations composing the original stage generated by the optimizer. Parameters are saved for reuse in the Object Store (Section 4.1.3). Once the logical plan is generated, MPC traverses the DAG in topological order and maps each logical stage into a physical implementation. Phys- ical implementations are AOT-compiled, parameterized, lock-free computation units. Each physical stage can be seen as a parametric function which will be dynamically fed at runtime with the proper data vectors and pipeline- specific parameters. This design allows PRETZEL runtime to share the same physical implementation between mul- tiple pipelines and no memory allocation occurs on the prediction-path (more details in Section 4.2.1). Logical plans maintain the mapping between the pipeline-specific parameters saved in the Object Store and the physical stages executing on the runtime. Figure 6 summarized the process of generating model plans out of ML2 pipelines. 4.1.3 Object Store The motivation behind Object Store is based on the in- sights of Figure 3: since many DAGs have similar struc- tures, sharing operators’ state (parameters) can consid- erably improve memory footprint, and consequently the number of predictions served per machine. An example is language dictionaries used for input text featurization, which are often in common among many models and are relatively large. The Object Store is populated off-line by var fContext = ...; var Tokenizer = ...; return fPrgm.Plan(); (1) Flour Transforms Logical Stages S1 S2 S3 1: [x] 2: [y,z] 3: … int[100] float[200] … Params Stats Physical Stages S1 S2 S3 (3) Compilation (2) Optimization Model Stats Params Logical Stages Physical Stages Model Plan Figure 6: Model optimization and compilation in PRET- ZEL. In (1), a model is translated into a Flour program. (2) Oven Optimizer generates a DAG of logical stages from the program. Additionally, parameters and statistics are extracted. (3) A DAG of physical stages is generated by the Oven Compiler using logical stages, parameters, and statistics. A model plan is the union of all the elements and is fed to the runtime. (such as most featurizers) are pipelined together in a sin- gle pass over the data. This strategy achieves best data locality because records are likely to reside in CPU reg- isters [33, 38]. Compute-intensive transformations (e.g., vectors or matrices multiplications) are executed one-at- a-time so that Single Instruction, Multiple Data (SIMD) vectorization can be exploited, therefore optimizing the number of instructions per record [56, 20]. Stages are gen- erated by traversing the Flour transformation graph until a pipeline-breaking transformation is found: all transforma- tions up to that point are then grouped into a single stage. Within each stage, compute-bound operations are vec- torized. This strategy allows PRETZEL to considerably decrease the number of vectors (and as a consequence memory) with respect to the operator-at-a-time strategy of ML2. Dependencies between stages are created as ag- gregation of the dependencies between the internal trans- formations. Transformation classes are annotated (e.g., 1-to-1, 1-to-n, memory-bound, compute-bound, commu- tative and associative) to ease the optimization process: no dynamic compilation [26] is necessary since the set of operators is fixed and manual annotation is sufficient to generate properly optimized plans 4. In the example sentiment analysis pipeline of Figure 1, Oven is able to recognize that the Linear Regression can be pushed into Char and WordNgram, therefore bypassing the execution of Concat. Additionally, Tokenizer can be reused between Char and WordNgram, therefore it will be pipelined with CharNgram (in one stage) and a dependency between CharNgram and WordNGram (in another stage) will be created. The final plan will therefore be composed by 2 stages, versus the initial 4 operators (and vectors) of ML2. Model Plan Compiler: Model plans are composed by two DAGs: a DAG composed of logical stages, and a DAG of physical stages. Logical stages are an abstrac- tion of the stages output of the Oven Optimizer, with re- lated parameters; physical stages contains the actual code that will be executed by the PRETZEL runtime. For.each given DAG, there is a 1-to-n mapping between logical to physical stages so that a logical stage can represent the execution code of different physical implementations. A physical implementation is selected based on the parame- ters characterizing a logical stage and available statistics. Plan compilation is a two step process. After the stage DAG is generated by the Oven Optimizer, the Model Plan Compiler (MPC) maps each stage into its logical representation containing all the parameters for the trans- formations composing the original stage generated by the optimizer. Parameters are saved for reuse in the Object Store (Section 4.1.3). Once the logical plan is generated, MPC traverses the DAG in topological order and maps each logical stage into a physical implementation. Phys- ical implementations are AOT-compiled, parameterized, lock-free computation units. Each physical stage can be seen as a parametric function which will be dynamically fed at runtime with the proper data vectors and pipeline- specific parameters. This design allows PRETZEL runtime to share the same physical implementation between mul- tiple pipelines and no memory allocation occurs on the prediction-path (more details in Section 4.2.1). Logical plans maintain the mapping between the pipeline-specific parameters saved in the Object Store and the physical stages executing on the runtime. Figure 6 summarized the process of generating model plans out of ML2 pipelines. 4.1.3 Object Store The motivation behind Object Store is based on the in- sights of Figure 3: since many DAGs have similar struc- tures, sharing operators’ state (parameters) can consid- erably improve memory footprint, and consequently the var fContext = ...; var Tokenizer = ...; return fPrgm.Plan(); (1) Flour Transforms Logical Stages S1 S2 S3 1: [x] 2: [y,z] 3: … int[100] float[200] … Params Stats Physical Stages S1 S2 S3 (3) Compilation (2) Optimization Model Stats Params Logical Stages Physical Stages Model Plan Figure 6: Model optimization and compilation in PRET- ZEL. In (1), a model is translated into a Flour program. (2) Oven Optimizer generates a DAG of logical stages from the program. Additionally, parameters and statistics are extracted. (3) A DAG of physical stages is generated by the Oven Compiler using logical stages, parameters, and statistics. A model plan is the union of all the elements and is fed to the runtime. (such as most featurizers) are pipelined together in a sin- gle pass over the data. This strategy achieves best data locality because records are likely to reside in CPU reg- isters [33, 38]. Compute-intensive transformations (e.g., vectors or matrices multiplications) are executed one-at- a-time so that Single Instruction, Multiple Data (SIMD) vectorization can be exploited, therefore optimizing the number of instructions per record [56, 20]. Stages are gen- erated by traversing the Flour transformation graph until a pipeline-breaking transformation is found: all transforma- tions up to that point are then grouped into a single stage. Within each stage, compute-bound operations are vec- torized. This strategy allows PRETZEL to considerably decrease the number of vectors (and as a consequence memory) with respect to the operator-at-a-time strategy of ML2. Dependencies between stages are created as ag- gregation of the dependencies between the internal trans- formations. Transformation classes are annotated (e.g., 1-to-1, 1-to-n, memory-bound, compute-bound, commu- tative and associative) to ease the optimization process: no dynamic compilation [26] is necessary since the set of operators is fixed and manual annotation is sufficient to generate properly optimized plans 4. In the example sentiment analysis pipeline of Figure 1, Oven is able to 4Note that ML2 does provide a second order operator accepting arbitrary code requiring dynamic compilation. However, this is not supported in our current version of PRETZEL. Char and WordNgram, therefore bypassing the e of Concat. Additionally, Tokenizer can be reused Char and WordNgram, therefore it will be pipelin CharNgram (in one stage) and a dependency CharNgram and WordNGram (in another stage created. The final plan will therefore be compo stages, versus the initial 4 operators (and vectors) Model Plan Compiler: Model plans are comp two DAGs: a DAG composed of logical stage DAG of physical stages. Logical stages are an tion of the stages output of the Oven Optimizer, lated parameters; physical stages contains the act that will be executed by the PRETZEL runtime. given DAG, there is a 1-to-n mapping between lo physical stages so that a logical stage can repre execution code of different physical implementa physical implementation is selected based on the ters characterizing a logical stage and available s Plan compilation is a two step process. After t DAG is generated by the Oven Optimizer, the Plan Compiler (MPC) maps each stage into its representation containing all the parameters for t formations composing the original stage generate optimizer. Parameters are saved for reuse in the Store (Section 4.1.3). Once the logical plan is ge MPC traverses the DAG in topological order an each logical stage into a physical implementatio ical implementations are AOT-compiled, param lock-free computation units. Each physical stag seen as a parametric function which will be dyn fed at runtime with the proper data vectors and p specific parameters. This design allows PRETZEL to share the same physical implementation betw tiple pipelines and no memory allocation occur prediction-path (more details in Section 4.2.1). plans maintain the mapping between the pipeline parameters saved in the Object Store and the stages executing on the runtime. Figure 6 summa process of generating model plans out of ML2 p 4.1.3 Object Store The motivation behind Object Store is based o sights of Figure 3: since many DAGs have simil tures, sharing operators’ state (parameters) can erably improve memory footprint, and conseque number of predictions served per machine. An e is language dictionaries used for input text featu which are often in common among many models relatively large. The Object Store is populated of 7 Oven Optimiser var fContext = ...; var Tokenizer = ...; return fPrgm.Plan(); (1) Flour Transforms Logical Stages S1 S2 S3 1: [x] 2: [y,z] 3: … int[100] float[200] … Params Stats Physical Stages S1 S2 S3 (3) Compilation (2) Optimization Model Stats Params Logical Stages Physical Stages Model Plan Figure 6: Model optimization and compilation in PRET- ZEL. In (1), a model is translated into a Flour program. (2) Oven Optimizer generates a DAG of logical stages from the program. Additionally, parameters and statistics are extracted. (3) A DAG of physical stages is generated by the Oven Compiler using logical stages, parameters, and statistics. A model plan is the union of all the elements and is fed to the runtime. (such as most featurizers) are pipelined together in a sin- gle pass over the data. This strategy achieves best data locality because records are likely to reside in CPU reg- isters [33, 38]. Compute-intensive transformations (e.g., vectors or matrices multiplications) are executed one-at- a-time so that Single Instruction, Multiple Data (SIMD) vectorization can be exploited, therefore optimizing the number of instructions per record [56, 20]. Stages are gen- erated by traversing the Flour transformation graph until a pipeline-breaking transformation is found: all transforma- tions up to that point are then grouped into a single stage. Within each stage, compute-bound operations are vec- torized. This strategy allows PRETZEL to considerably decrease the number of vectors (and as a consequence memory) with respect to the operator-at-a-time strategy of ML2. Dependencies between stages are created as ag- gregation of the dependencies between the internal trans- formations. Transformation classes are annotated (e.g., 1-to-1, 1-to-n, memory-bound, compute-bound, commu- tative and associative) to ease the optimization process: no dynamic compilation [26] is necessary since the set of operators is fixed and manual annotation is sufficient to generate properly optimized plans 4. In the example sentiment analysis pipeline of Figure 1, Oven is able to 4Note that ML2 does provide a second order operator accepting arbitrary code requiring dynamic compilation. However, this is not supported in our current version of PRETZEL. recognize that the Linear Regression can be pushed into Char and WordNgram, therefore bypassing the execution of Concat. Additionally, Tokenizer can be reused between Char and WordNgram, therefore it will be pipelined with CharNgram (in one stage) and a dependency between CharNgram and WordNGram (in another stage) will be created. The final plan will therefore be composed by 2 stages, versus the initial 4 operators (and vectors) of ML2. Model Plan Compiler: Model plans are composed by two DAGs: a DAG composed of logical stages, and a DAG of physical stages. Logical stages are an abstrac- tion of the stages output of the Oven Optimizer, with re- lated parameters; physical stages contains the actual code that will be executed by the PRETZEL runtime. For.each given DAG, there is a 1-to-n mapping between logical to physical stages so that a logical stage can represent the execution code of different physical implementations. A physical implementation is selected based on the parame- ters characterizing a logical stage and available statistics. Plan compilation is a two step process. After the stage DAG is generated by the Oven Optimizer, the Model Plan Compiler (MPC) maps each stage into its logical representation containing all the parameters for the trans- formations composing the original stage generated by the optimizer. Parameters are saved for reuse in the Object Store (Section 4.1.3). Once the logical plan is generated, MPC traverses the DAG in topological order and maps each logical stage into a physical implementation. Phys- ical implementations are AOT-compiled, parameterized, lock-free computation units. Each physical stage can be seen as a parametric function which will be dynamically fed at runtime with the proper data vectors and pipeline- specific parameters. This design allows PRETZEL runtime to share the same physical implementation between mul- tiple pipelines and no memory allocation occurs on the prediction-path (more details in Section 4.2.1). Logical plans maintain the mapping between the pipeline-specific parameters saved in the Object Store and the physical stages executing on the runtime. Figure 6 summarized the process of generating model plans out of ML2 pipelines. 4.1.3 Object Store The motivation behind Object Store is based on the in- sights of Figure 3: since many DAGs have similar struc- tures, sharing operators’ state (parameters) can consid- erably improve memory footprint, and consequently the number of predictions served per machine. An example is language dictionaries used for input text featurization, which are often in common among many models and are relatively large. The Object Store is populated off-line by 7 var fContext = ...; var Tokenizer = ...; return fPrgm.Plan(); (1) Flour Transforms Logical Stages S1 S2 S3 1: [x] 2: [y,z] 3: … int[100] float[200] … Params Stats Physical Stages S1 S2 S3 (3) Compilation (2) Optimization Model Stats Params Logical Stages Physical Stages Model Plan Figure 6: Model optimization and compilation in PRET- ZEL. In (1), a model is translated into a Flour program. (2) Oven Optimizer generates a DAG of logical stages from the program. Additionally, parameters and statistics are extracted. (3) A DAG of physical stages is generated by the Oven Compiler using logical stages, parameters, and statistics. A model plan is the union of all the elements and is fed to the runtime. (such as most featurizers) are pipelined together in a sin- gle pass over the data. This strategy achieves best data locality because records are likely to reside in CPU reg- isters [33, 38]. Compute-intensive transformations (e.g., vectors or matrices multiplications) are executed one-at- a-time so that Single Instruction, Multiple Data (SIMD) vectorization can be exploited, therefore optimizing the number of instructions per record [56, 20]. Stages are gen- erated by traversing the Flour transformation graph until a pipeline-breaking transformation is found: all transforma- tions up to that point are then grouped into a single stage. Within each stage, compute-bound operations are vec- torized. This strategy allows PRETZEL to considerably decrease the number of vectors (and as a consequence memory) with respect to the operator-at-a-time strategy of ML2. Dependencies between stages are created as ag- gregation of the dependencies between the internal trans- formations. Transformation classes are annotated (e.g., 1-to-1, 1-to-n, memory-bound, compute-bound, commu- tative and associative) to ease the optimization process: no dynamic compilation [26] is necessary since the set of operators is fixed and manual annotation is sufficient to generate properly optimized plans 4. In the example sentiment analysis pipeline of Figure 1, Oven is able to recognize that the Linear Regression can be pushed into Char and WordNgram, therefore bypassing the execution of Concat. Additionally, Tokenizer can be reused between Char and WordNgram, therefore it will be pipelined with CharNgram (in one stage) and a dependency between CharNgram and WordNGram (in another stage) will be created. The final plan will therefore be composed by 2 stages, versus the initial 4 operators (and vectors) of ML2. Model Plan Compiler: Model plans are composed by two DAGs: a DAG composed of logical stages, and a DAG of physical stages. Logical stages are an abstrac- tion of the stages output of the Oven Optimizer, with re- lated parameters; physical stages contains the actual code that will be executed by the PRETZEL runtime. For.each given DAG, there is a 1-to-n mapping between logical to physical stages so that a logical stage can represent the execution code of different physical implementations. A physical implementation is selected based on the parame- ters characterizing a logical stage and available statistics. Plan compilation is a two step process. After the stage DAG is generated by the Oven Optimizer, the Model Plan Compiler (MPC) maps each stage into its logical representation containing all the parameters for the trans- formations composing the original stage generated by the optimizer. Parameters are saved for reuse in the Object Store (Section 4.1.3). Once the logical plan is generated, MPC traverses the DAG in topological order and maps each logical stage into a physical implementation. Phys- ical implementations are AOT-compiled, parameterized, lock-free computation units. Each physical stage can be seen as a parametric function which will be dynamically fed at runtime with the proper data vectors and pipeline- specific parameters. This design allows PRETZEL runtime to share the same physical implementation between mul- tiple pipelines and no memory allocation occurs on the prediction-path (more details in Section 4.2.1). Logical plans maintain the mapping between the pipeline-specific parameters saved in the Object Store and the physical stages executing on the runtime. Figure 6 summarized the process of generating model plans out of ML2 pipelines. 4.1.3 Object Store The motivation behind Object Store is based on the in- sights of Figure 3: since many DAGs have similar struc- tures, sharing operators’ state (parameters) can consid- erably improve memory footprint, and consequently the Flour API
  • 13. • Two main components: – Runtime, with an Object Store – Scheduler • Runtime handles physical resources: threads and buffers • Object Store caches objects of all models – models register and retrieve state objects via a key • Scheduler is event-based, each stage being an event 13 On-line phase
  • 15. • Two model classes written in ML.NET, running in ML.NET and Pretzel – 250 Sentiment-Analysis models – 50 Regression Task models • Testbed representing a small production server – 2 8-core Xeon E5-2620 v4 at 2.10 GHz, HT enabled – 32 GB RAM – Windows 10 15 Workload and testbed
  • 16. • 7x less memory for SA models, 40% for RT models • Higher model density, higher efficiency and profitability 16 Memory Figure 8: Cumulative memory usage of the model pipelines with and without Object Store, normalized by the maximum usage with Object Store for SA models. By sharing common objects, we can decrease memory usage by up to 7 ⇥ in SA and around 40% for RT models. Keeping track of plan parameters can also help reduce the time for loading models: as expected PRETZEL can load 300 models 7.4 times faster compared to ML2. 5.2 Latency In this experiment we study the latency behavior of PRET- ZEL and compare it with ML2. We also study hot and cold scenarios. Specifically, we label as cold the first predic- tion requested for a model; the successive 10 predictions are then discarded and we report hot numbers as the aver- age of the following 100 predictions. Inference requests are submitted sequentially and in isolation for one model at a time. For PRETZEL we use the request-response en- gine. The comparison between PRETZEL and ML2 is reported in Figure 9. As we can see, in ML2 the 99th per- centile latency in cold (denoted by P99cold) is more than an order-of-magnitude slower than P99hot, with a peak of almost a 300⇥ slowdown in the worst case. Conversely, in PRETZEL P99cold is within 3⇥ from the P99hot and Figure 9: Latency comparison between ML2 and PRET- ZEL, with values normalized over ML2’s P99hot latency. Cliffs indicate pipelines with big differences in latency. Figure 10: Latency of PRETZEL to run SA models with and without sub-plan materialization, normalized by the P99hot latency of the value obtained without the optimiza- tion. Around 85% of SA pipelines show 2.5⇥ speedup. Sub-plan materialization does not apply for RT pipelines. sult caching enabled, latency decreases even further when same inferences are requested for the same model. In this case the latency is proportional to a dictionary look up. 5.3 Throughput In this experiment, we assume a batch scenario where all 300 models are scored several times using a fixed batch
  • 17. • Hot vs Cold scenarios • Without and with Caching of partial results 17 Latency 8: Cumulative memory usage of the model es with and without Object Store, normalized by ximum usage with Object Store for SA models. ring common objects, we can decrease memory y up to 7 ⇥ in SA and around 40% for RT models. g track of plan parameters can also help reduce the r loading models: as expected PRETZEL can load dels 7.4 times faster compared to ML2. Latency xperiment we study the latency behavior of PRET- d compare it with ML2. We also study hot and cold os. Specifically, we label as cold the first predic- uested for a model; the successive 10 predictions discarded and we report hot numbers as the aver- he following 100 predictions. Inference requests mitted sequentially and in isolation for one model e. For PRETZEL we use the request-response en- The comparison between PRETZEL and ML2 is d in Figure 9. As we can see, in ML2 the 99th per- atency in cold (denoted by P99cold) is more than r-of-magnitude slower than P99hot, with a peak of a 300⇥ slowdown in the worst case. Conversely, TZEL P99cold is within 3⇥ from the P99hot and Figure 9: Latency comparison between ML2 and PRET- ZEL, with values normalized over ML2’s P99hot latency. Cliffs indicate pipelines with big differences in latency. Figure 10: Latency of PRETZEL to run SA models with and without sub-plan materialization, normalized by the P99hot latency of the value obtained without the optimiza- tion. Around 85% of SA pipelines show 2.5⇥ speedup. Sub-plan materialization does not apply for RT pipelines. sult caching enabled, latency decreases even further when same inferences are requested for the same model. In this case the latency is proportional to a dictionary look up. 5.3 Throughput In this experiment, we assume a batch scenario where all 300 models are scored several times using a fixed batch Figure 8: Cumulative memory usage of the model pipelines with and without Object Store, normalized by the maximum usage with Object Store for SA models. By sharing common objects, we can decrease memory usage by up to 7 ⇥ in SA and around 40% for RT models. Keeping track of plan parameters can also help reduce the time for loading models: as expected PRETZEL can load 300 models 7.4 times faster compared to ML2. 5.2 Latency In this experiment we study the latency behavior of PRET- ZEL and compare it with ML2. We also study hot and cold scenarios. Specifically, we label as cold the first predic- tion requested for a model; the successive 10 predictions are then discarded and we report hot numbers as the aver- age of the following 100 predictions. Inference requests are submitted sequentially and in isolation for one model at a time. For PRETZEL we use the request-response en- gine. The comparison between PRETZEL and ML2 is reported in Figure 9. As we can see, in ML2 the 99th per- centile latency in cold (denoted by P99cold) is more than an order-of-magnitude slower than P99hot, with a peak of almost a 300⇥ slowdown in the worst case. Conversely, in PRETZEL P99cold is within 3⇥ from the P99hot and 20⇥ in the worst case (an improvement of more than 10⇥ on tail latency). If instead we compare directly PRET- ZEL with ML2, PRETZEL is 2.6⇥ faster then ML2 in the P99hot case, and about 15⇥ in the P99cold case. Peak per- formance gains for PRETZEL for SA pipelines is above 50⇥, 100⇥ in the RT category. Sub-plan Materialization: Figure 10 depicts the effect of the sub-plan materialization in prediction latency. As we can notice, only part of the pipelines benefit from materializing partial results, while no pipeline shows per- formance deterioration (except the first cold scoring of the first model, which shows a 6⇥ slowdown). In general, Figure 9: Latency comparison between ML2 and PRET- ZEL, with values normalized over ML2’s P99hot latency. Cliffs indicate pipelines with big differences in latency. Figure 10: Latency of PRETZEL to run SA models with and without sub-plan materialization, normalized by the P99hot latency of the value obtained without the optimiza- tion. Around 85% of SA pipelines show 2.5⇥ speedup. Sub-plan materialization does not apply for RT pipelines. sult caching enabled, latency decreases even further when same inferences are requested for the same model. In this case the latency is proportional to a dictionary look up. 5.3 Throughput In this experiment, we assume a batch scenario where all 300 models are scored several times using a fixed batch size of 1000. We used from 2 up to 29 cores to serve requests, while 3 cores are reserved to generate them. The main goal is to measure the maximum number of requests PRETZEL and ML2 can serve per second. Materialization and caching are not enabled for this experiment. Figure 11 shows that PRETZEL’s throughput is up to 2.5⇥ higher than ML2 (at 16 cores scale) and scales on par with the expected maximum scaling 5 whereas ML2 suffers from decrease in scalability when the number of CPU cores increases. This is because each thread has its own internal copy of models whereby cache lines are 5 W/o Caching W/o and w/ Caching • ML.NET cold: 10-300x w.r.t. hot • Pretzel cold: 3-20x slower than hot • Pretzel vs ML:NET: cold is 15x faster, hot is 2.6x • For single models runs, average improvement is 2.5x • More gain with multiple requests
  • 18. • Batch size is 1000 inputs, multiple runs of same batch • Delay batching: as in Clipper, batch requests and accept higher latency 18 Throughput Figure 11: The average throughput computed among the 300 models to process one million inputs each. We scale the number of cores on the x-axis. PRETZEL scales lin- early to the number of CPU cores, close to the expected maximum throughput with Hyper Threading enabled. not shared, thus increasing the pressure on the memory subsystem: indeed, even if the data values are the same, the model objects are mapped to different memory areas. Delay Batching: As in Clipper, PRETZEL FrontEnd al- lows users to specify a maximum delay they can wait to maximize throughput. For each model category, Figure 12 depicts the trade-off between throughput and latency as the delay increases. Interestingly, for such models even a small delay helps reaching the optimal batch size. Figure 12: Throughput and latency of SA and RT models latency, instead, gracefully increases linearly with the increase of the load. Figure 13: Throughput and latency of PRETZEL under heavy load scenario. Reservation Scheduling: If we run the previous exper- iment using reservation scheduling for one model, such model does not encounter any degradation in latency (max improvement of 3 orders-of-magnitude) as the load in- creases, while the total throughput increases about 3%. In the experiment, 28 cores are used for the shared load, while one core (and related vectors) is reserved exclu- sively for one model. 6 Limitations and Future Work Off-line Phase: PRETZEL suffers two limitations regard- ing Flour and Oven design. First, PRETZEL currently has several logical and physical stages classes, one per possi- ble implementation, which make the system difficult to maintain in the long run. Additionally, different back-ends (e.g., PRETZEL currently supports operators implemented in C# and C++, and experimentally on FPGA [X]) require all specific operator implementations. We are however confident that this limitation will be overcome once code 300 models to process one million inputs each. We scale the number of cores on the x-axis. PRETZEL scales lin- early to the number of CPU cores, close to the expected maximum throughput with Hyper Threading enabled. not shared, thus increasing the pressure on the memory subsystem: indeed, even if the data values are the same, the model objects are mapped to different memory areas. Delay Batching: As in Clipper, PRETZEL FrontEnd al- lows users to specify a maximum delay they can wait to maximize throughput. For each model category, Figure 12 depicts the trade-off between throughput and latency as the delay increases. Interestingly, for such models even a small delay helps reaching the optimal batch size. Figure 12: Throughput and latency of SA and RT models with delayed batching: while throughput increases with the delay, also record latency increase. RT latency is more tolerant to larger delays compared to SA. 5.4 Heavy Load In this last experiment we load all 300 models in one in- stance and we submit requests to all models concurrently following a Zipf distribution. As for the throughput exper- iment we use 3 threads to generate load and 29 to serve requests. Figure 13 reports how throughput and latency vary as we increase the load. As expected, PRETZEL’s throughput increases initially and stabilizes. PRETZEL’s Figure 13: Throughput and latency of PRET heavy load scenario. Reservation Scheduling: If we run the prev iment using reservation scheduling for one m model does not encounter any degradation in la improvement of 3 orders-of-magnitude) as t creases, while the total throughput increases In the experiment, 28 cores are used for the s while one core (and related vectors) is reser sively for one model. 6 Limitations and Future Work Off-line Phase: PRETZEL suffers two limitati ing Flour and Oven design. First, PRETZEL cu several logical and physical stages classes, on ble implementation, which make the system maintain in the long run. Additionally, differen (e.g., PRETZEL currently supports operators im in C# and C++, and experimentally on FPGA [ all specific operator implementations. We ar confident that this limitation will be overcome generation of stages will be added (e.g., with specific templates [34]). Secondly, Flour and currently limited to pipelines authored in ML ing models from different frameworks to the approach may require non-trivial work. On th our goal is, however, to target unified model fo as ONNX [5]; this will allow us to apply the techniques to models from other ML framewo On-line Phase: PRETZEL’s fine-grained, s scheduling may introduce additional overhea spect to coarse-grained whole pipeline schedu additional buffering and context switching. heads are, however, related to the system load fore controllable by the scheduler. On white bo tures, failures happening during the execution 11 W/o Delay batching • ML.NET suffers from missed data sharing, i.e. higher memory traffic W/ Delay batching • Not supported in ML.NET • Quickly reaches optimal batch size
  • 20. • We addressed performance/density bottlenecks in ML inference for MaaS • We advocate the adoption of a white-box approach • We are re-thinking ML inference as a DB query problem - buffer/locality issues - code generation - runtime, hardware-aware scheduling 20 Conclusions
  • 21. • Oven has pre-written implementations for logical and physical stages: not maintainable in the long run • We are adding code-generation of stages, maybe with hardware-specific templates [10] • Tune code-generation for more-aggressive optimizations • Support more model formats than ML.NET, like ONNX [11] 21 Off-line phase [10] K.Krikellas, S.Viglas,et al. Generating code for holistic query evaluation, in ICDE, pages 613–624. IEEE Computer Society, 2010 [11] Open Neural Network Exchange (ONNX). https://onnx.ai, 2017
  • 22. • Distributed version • Main work is on NUMA awareness, to improve locality and scheduling – Schedule on the NUMA node where most shared state resides – Duplicate hot, shared state objects on nodes for lower latency – Scale better, avoiding bottlenecks on coherence bus 22 On-line phase Pretzel is currently under submission at OSDI ‘18 QUESTIONS ? • Distributed version • Main work is on NUMA awareness, to improve locality and scheduling – Schedule on the NUMA node where most shared state resides – Duplicate hot, shared state objects on nodes for lower latency – Scale better, avoiding bottlenecks on coherence bus Speaker: Alberto Scolari, PhD student @ Politecnico di Milano, Italy - alberto.scolari@polimi.it
  • 24. • Oven’s Model Plan Compiler compiles logical stages into physical stages, based on parameters and statistics 24 Off-line phase - Oven [9] A.Crotty,A.Galakatos,K.Dursun, et al. An architecture for compiling UDF-centric workflows. PVLDB, 8(12):1466–1477, Aug. 2015 var fContext = ...; var Tokenizer = ...; return fPrgm.Plan(); (1) Flour Transforms Logical Stages S1 S2 S3 1: [x] 2: [y,z] 3: … int[100] float[200] … Params Stats Physical Stages S1 S2 S3 (3) Compilation (2) Optimization Model Stats Params Logical Stages Physical Stages Model Plan Figure 6: Model optimization and compilation in PRET- ZEL. In (1), a model is translated into a Flour program. (2) Oven Optimizer generates a DAG of logical stages from the program. Additionally, parameters and statistics are extracted. (3) A DAG of physical stages is generated by the Oven Compiler using logical stages, parameters, and statistics. A model plan is the union of all the elements and is fed to the runtime. (such as most featurizers) are pipelined together in a sin- gle pass over the data. This strategy achieves best data locality because records are likely to reside in CPU reg- isters [33, 38]. Compute-intensive transformations (e.g., vectors or matrices multiplications) are executed one-at- a-time so that Single Instruction, Multiple Data (SIMD) vectorization can be exploited, therefore optimizing the number of instructions per record [56, 20]. Stages are gen- erated by traversing the Flour transformation graph until a pipeline-breaking transformation is found: all transforma- tions up to that point are then grouped into a single stage. Within each stage, compute-bound operations are vec- torized. This strategy allows PRETZEL to considerably decrease the number of vectors (and as a consequence memory) with respect to the operator-at-a-time strategy of ML2. Dependencies between stages are created as ag- gregation of the dependencies between the internal trans- formations. Transformation classes are annotated (e.g., 1-to-1, 1-to-n, memory-bound, compute-bound, commu- tative and associative) to ease the optimization process: no dynamic compilation [26] is necessary since the set of operators is fixed and manual annotation is sufficient to generate properly optimized plans 4. In the example sentiment analysis pipeline of Figure 1, Oven is able to 4Note that ML2 does provide a second order operator accepting arbitrary code requiring dynamic compilation. However, this is not supported in our current version of PRETZEL. recognize that the Linear Regression can be pushed into Char and WordNgram, therefore bypassing the execution of Concat. Additionally, Tokenizer can be reused between Char and WordNgram, therefore it will be pipelined with CharNgram (in one stage) and a dependency between CharNgram and WordNGram (in another stage) will be created. The final plan will therefore be composed by 2 stages, versus the initial 4 operators (and vectors) of ML2. Model Plan Compiler: Model plans are composed by two DAGs: a DAG composed of logical stages, and a DAG of physical stages. Logical stages are an abstrac- tion of the stages output of the Oven Optimizer, with re- lated parameters; physical stages contains the actual code that will be executed by the PRETZEL runtime. For.each given DAG, there is a 1-to-n mapping between logical to physical stages so that a logical stage can represent the execution code of different physical implementations. A physical implementation is selected based on the parame- ters characterizing a logical stage and available statistics. Plan compilation is a two step process. After the stage DAG is generated by the Oven Optimizer, the Model Plan Compiler (MPC) maps each stage into its logical representation containing all the parameters for the trans- formations composing the original stage generated by the optimizer. Parameters are saved for reuse in the Object Store (Section 4.1.3). Once the logical plan is generated, MPC traverses the DAG in topological order and maps each logical stage into a physical implementation. Phys- ical implementations are AOT-compiled, parameterized, lock-free computation units. Each physical stage can be seen as a parametric function which will be dynamically fed at runtime with the proper data vectors and pipeline- specific parameters. This design allows PRETZEL runtime to share the same physical implementation between mul- tiple pipelines and no memory allocation occurs on the prediction-path (more details in Section 4.2.1). Logical plans maintain the mapping between the pipeline-specific parameters saved in the Object Store and the physical stages executing on the runtime. Figure 6 summarized the process of generating model plans out of ML2 pipelines. 4.1.3 Object Store The motivation behind Object Store is based on the in- sights of Figure 3: since many DAGs have similar struc- tures, sharing operators’ state (parameters) can consid- erably improve memory footprint, and consequently the number of predictions served per machine. An example is language dictionaries used for input text featurization, which are often in common among many models and are relatively large. The Object Store is populated off-line by 7 • Oven optimises pipelines much like queries • Tupleware’s hybrid approach [9] – memory-intensive operators are merged into a logical stage, to increase data locality – Compute-intensive operators stay alone, to leverage SIMD execution – One-to-one operators are chained together, while one-to-many can break locality
  • 25. • Load all 300 models upfront • 3 threads to generate requests, 29 to score • One core serves single requests, the others serve batches • ML.NET does not support this scenarios • Latency degrades gracefully with load 25 Heavy load Figure 11: The average throughput computed among the 300 models to process one million inputs each. We scale the number of cores on the x-axis. PRETZEL scales lin- early to the number of CPU cores, close to the expected maximum throughput with Hyper Threading enabled. not shared, thus increasing the pressure on the memory subsystem: indeed, even if the data values are the same, the model objects are mapped to different memory areas. Delay Batching: As in Clipper, PRETZEL FrontEnd al- lows users to specify a maximum delay they can wait to maximize throughput. For each model category, Figure 12 depicts the trade-off between throughput and latency as the delay increases. Interestingly, for such models even a small delay helps reaching the optimal batch size. Figure 12: Throughput and latency of SA and RT models latency, instead, gracefully increases linearly with the increase of the load. Figure 13: Throughput and latency of PRETZEL under heavy load scenario. Reservation Scheduling: If we run the previous exper- iment using reservation scheduling for one model, such model does not encounter any degradation in latency (max improvement of 3 orders-of-magnitude) as the load in- creases, while the total throughput increases about 3%. In the experiment, 28 cores are used for the shared load, while one core (and related vectors) is reserved exclu- sively for one model. 6 Limitations and Future Work Off-line Phase: PRETZEL suffers two limitations regard- ing Flour and Oven design. First, PRETZEL currently has several logical and physical stages classes, one per possi- ble implementation, which make the system difficult to maintain in the long run. Additionally, different back-ends (e.g., PRETZEL currently supports operators implemented in C# and C++, and experimentally on FPGA [X]) require all specific operator implementations. We are however confident that this limitation will be overcome once code generation of stages will be added (e.g., with hardware-