Pretzel: optimized Machine Learning framework for low-latency and high throughput workloads

Speaker: Alberto Scolari, PhD student @ Politecnico di Milano, Italy - alberto.scolari@polimi.it
Oracle Labs, Belmont, 05/30/2018
Yunseong Lee, Alberto Scolari, Matteo Interlandi, Markus
Weimer, Marco D. Santambrogio, Byung-Gon Chun
PRETZEL: Opening the Black Box
of Machine Learning Prediction Serving Systems

ML-as-a-Service
• ML models are deployed on cloud platforms as black
boxes
• Users deploy multiple models per machine (10-100s)
2
• Deployed models are often similar
- Similar structure
- Similar state
• Two key requirements:
1. Performance: latency or throughput
2. Model density: number of models per
machine
good:0
bad:1
…
This is a good
product
good:0
bad:1
…
This is a good
product

We need to know structure and state:
white-box model
3
Breaking the black-box model
1. To generate an optimised version of a model on
deployment: higher performance
2. To allocate shared state only once, and share among
models: higher density

‣ Opening the black box
‣ Pretzel, white-box serving system
‣ Evaluation
‣ Conclusions and future work
4
Outline

6
Case study
• 300 models in ML.NET [1], C#, from Amazon Reviews dataset
[2]: 250 Sentiment Analysis + 50 Regression Task
• Analyzed and accelerated in previous work [3]
• Most of the time is spent in featurization
• No single bottleneck
PRETZEL: Opening the Black Box of Machine Learning
Prediction Serving Systems
Paper # 355
Abstract
Machine Learning models are often composed of
pipelines of transformations. While this design allows to
efficiently execute single model components at training-
time, prediction serving has different requirements such
as low latency, high throughput and graceful performance
degradation under heavy load. Current prediction serv-
ing systems consider models as black boxes, whereby
prediction-time-specific optimizations are ignored in fa-
vor of ease of deployment. In this paper, we present
PRETZEL, a prediction serving system introducing a
novel white box architecture enabling both end-to-end
and multi-model optimizations. Using production-like
model pipelines, our experiments show that PRETZEL is
able to introduce performance improvements over differ-
ent dimensions; compared to state-of-the-art approaches
PRETZEL is able to reduce latency by up to 100⇥ while
reducing memory footprint by up to 7⇥, and increasing
throughput by up to 2.5⇥.
1 Introduction
Many Machine Learning (ML) frameworks such as
Google TensorFlow [2], Facebook Caffe2 [4], Scikit-
learn [41], or CompanyX’s Internal ML Library (ML2)
allow data scientists to declaratively author pipelines of
transformations to train models from large-scale input
datasets. Model pipelines are internally represented as
Directed Acyclic Graphs (DAGs) of operators comprising
data transformations and featurizers (e.g., string tokeniza-
tion, hashing, etc.), and ML models (e.g., decision trees,
linear models, SVMs, etc.). Figure 1 shows an example
pipeline for text analysis whereby input sentences are
classified according to the expressed sentiment.
Machine Learning is usually conceptualized as a two-
steps process: first, during training model parameters are
estimated from large datasets by running computation-
ally intensive iterative algorithms; successively, trained
pipelines are used for inference to generate predictions
through the estimated model parameters. When trained
“This is a nice product”
Positive vs. Negative
Tokenizer
Char
Ngram
Word
Ngram
Concat
Linear
Regression
Figure 1: A sentimental analysis pipeline consisting of
operators for featurization (ellipses), followed by a ML
model (diamond). Tokenizer extracts tokens (e.g., words)
from the input string. Char and Word Ngrams featurize
input tokens by extracting n-grams. Concat generates a
unique feature vector which is then scored by a Linear
Regression predictor. This is a simplification: the actual
DAG contains about 12 operators.
pipelines are served for inference, the full set of operators
is deployed altogether.However, pipelines have different
system characteristics based on the phase in which they
are employed: for instance, at training time ML models
run complex algorithms to scale over large datasets (e.g.,
linear models can use gradient descent in one of its many
flavors [43, 42, 44]), while, once trained, they behave as
other regular featurizers and data transformations; further-
more, during inference pipelines are often surfaced for
direct users’ servicing and therefore require low latency,
high throughput, and graceful degradation of performance
in case of load spikes.
Existing prediction serving systems, such as Clip-
per [7, 25], TensorFlow Serving [3, 39], Rafiki [49], ML2
itself, and others [13, 14, 36, 12] focus mainly on ease
of deployment, where pipelines are considered as black
boxes and deployed into containers (e.g., Docker [9] in
Clipper and Rafiki, servables in TensorFlow Serving).
Under this strategy, only “pipeline-agnostic” optimiza-
tions such as caching, batching and buffering are avail-
able. Nevertheless, we found that black box approaches
fell short on several aspects. For instance, prediction ser-
1
Fig. 2. Execution breakdown of the example model
2 THE CASE FOR ACCELERATING GENERIC PRE-
DICTION PIPELINES
Nowadays, several “intelligent” services such as Microsoft
Cortana speech recognition, Netflix movie recommender or
Gmail spam detector depend on ML model scoring capabil-
ities, and are currently experiencing a growing demand that
in turn fosters the research on their acceleration.
While efforts exist that try to accelerate specific types of
models (e.g., Brainwave for deep neural networks) many
models we actually see in production are often generic
and composed of several (tens of) different transformations.
Many workloads run in a prediction-serving, cloud-like
setting, where prediction models coming from data science
experts are operationalized. While data scientists prefer to
use high-level declarative tools such as IMLT for better
productivity, operationalized models require low latency,
high throughput, and highly predictable performance. By
separating the platform where models are authored (and
trained) from the one where models are operationalized, we
can make simplifying assumptions on the latter:
• models are pre-trained and do not change during
execution; 1
• models from the same user likely share several
transformations, or, in case such transformations are
stateful (and immutable), they likely share part of
the state; in some cases (for example a tokenization
process on a common English vocabulary) the state
sharing may hold even beyond single users
where the prediction (the LinReg transformation
very small amount of time compared to other com
(0.3% of the execution time). Instead, the n-gram
mations take up almost 60% of the execution tim
another 33% is spent in concatenating the feature
which consists in allocating memory and copying
summary, most of the execution time is lost in p
the data for scoring, not in computing the predictio
workloads, we observed that this situation is co
many models, because many tasks employ simpl
tion models like linear regression functions, both b
their prediction latency, and because of their simp
understandability. In these scenarios, we can conc
the acceleration of an ML pipeline cannot focus
specific transformations, as in the case of neural
where the bulk of computation is spent in scoring, b
a more systematic approach.
3 CONSIDERATIONS AND SYSTEM IMPLE
TION
Following the observations of Section 2, we ar
acceleration of generic ML prediction pipelines
ble if we optimize models end-to-end instead
transformations, i.e., while data scientist focus
level “logical” transformations, the operationalizat
model for scoring should focus on execution unit
optimal decisions on how data is processed throug
pipelines. We call such execution units stages, borro
term from the database and system communities
stage is a sequence of one or more transformation
executed together on a device (either a CPU or a
in this work), and represents the “atomic” unit of
putation. No data movement from the computing
another occurs within a stage, while communica
occur between two stages running on different dev
the PCIe communication when offloading to FPGA
[2] R. He and J. McAuley. Ups and downs: Modeling the visual evolution of fashion trends with one-class collaborative filtering. In Proceedings of
the 25th International Conference on World Wide Web, WWW ’16
[3] A. Scolari, Y. Lee, M. Weimer and M. Interlandi, "Towards Accelerating Generic Machine Learning Prediction Pipelines," 2017 IEEE International
Conference on Computer Design (ICCD)

• Overheads are a remarkable % of runtime
- JIT, GC, virtual calls, …
• Operators have different buffers and break
locality
• Operators are executed one at a time: no code
fusion, multiple data accesses
7
Limitations of black-box(a) We identify the operators by their parameters. The first two groups represent N-gram
operators, which have multiple versions with different length (e.g., unigram, trigram)
(b) 3 different ML models are use
each with distinct parameters.
Figure 3: Probability for an operator within the 250 different pipelines. If an operator x has the probability y%, the
250⇥ y
100 pipelines have operator x, implying that pipelines can save memory by sharing one instance of operator x
Figure 4: CDF of latency of prediction requests of 250
DAGs. We denote the first prediction as cold; the hot
line is reported as average over 100 predictions after a
warm-up period of 10. The plot is normalized over the
99th percentile latency of the hot case.
describes this situation, where the performance of hot
predictions over the 250 sentiment analysis pipelines with
memory already allocated and JIT-compiled code is more
than an order-of-magnitude faster then the cold version
for the same pipelines. To drill down more into the prob-
lem, we found that 57.4% of the total execution time for a
single cold prediction is spent in pipeline analysis and ini-
tialization of the function chain, 36.5% in JIT compilation
and the remaining is actual computation time.
Infrequent Accesses: In order to meet milliseconds-level
latencies [51], model pipelines have to reside in main
memory (possibly already warmed-up), since they can
have MBs to GBs (compressed) size on disk, with loading
and initialization times easily exceeding several seconds.
A common practice in production settings is to unload
a pipeline if not accessed after a certain period of time
(e.g., a few hours). Once evicted, successive accesses will
incur a model loading penalty and warming-up, therefore
violating Service Level Agreement (SLA).
Operator-at-a-time Model: As previously described,
predictions over ML2 pipelines are computed by pullin
records through a sequence of operators, each of the
transforming to the input vector(s) and producing on
or more new vectors. While (as is common practice f
in-memory data-intensive systems [38, 48, 18]) some i
terpretation overheads are eliminated via JIT compilatio
operators in ML2 (and in other tools) are “logical” entiti
(e.g., linear regression, tokenizer, one-hot encoder, etc
with diverse performance characteristics. Figure 5 show
the latency breakdown of one execution of the sentime
analysis pipeline of Figure 1, where the only ML oper
tor (linear regression) takes two orders-of-magnitude le
time with respect to the slowest operator (WordNgram
It is common practice for in-memory data-intensive sy
tems to pipeline operators in order to minimize memo
accesses for memory-intensive workloads, and to ve
torize compute intensive operators in order to minimiz
the number of instructions per data item [26, 56]. ML
operator-at-a-time model [56] (as other libraries missin
an optimization layer, such as Scikit-learn) is therefo
sub-optimal in that computation is organized around lo
ical operators, ignoring how those operators behave t
gether: in the example of the sentiment analysis pipelin
at hand, linear regression is commutative and associativ
(e.g., dot product between vectors) and can be pipeline
with Char and WordNgram, eliminating the need for th
Concat operation. As we will see in following Section
PRETZEL optimizer is able to detect this situation and
generate an execution plan that is several times faster wi
regard to ML2 unoptimized version of the pipeline.
Coarse Grained Scheduling: Scheduling CPU r
sources carefully is essential to serve highly concurre
requests and run machines to maximum utilization. Und
the black box approach: (1) a thread pool is used to serv
multiple concurrent requests to the same model pipelin
(2) for each request, one thread handles the execution
4
(a) We identify the operators by their parameters. The first two groups represent N-gram
operators, which have multiple versions with different length (e.g., unigram, trigram)
(b) 3 different ML models are used,
each with distinct parameters.
Figure 3: Probability for an operator within the 250 different pipelines. If an operator x has the probability y%, then
250⇥ y
100 pipelines have operator x, implying that pipelines can save memory by sharing one instance of operator x.
Figure 4: CDF of latency of prediction requests of 250
DAGs. We denote the first prediction as cold; the hot
line is reported as average over 100 predictions after a
warm-up period of 10. The plot is normalized over the
99th percentile latency of the hot case.
describes this situation, where the performance of hot
predictions over the 250 sentiment analysis pipelines with
memory already allocated and JIT-compiled code is more
than an order-of-magnitude faster then the cold version
for the same pipelines. To drill down more into the prob-
lem, we found that 57.4% of the total execution time for a
single cold prediction is spent in pipeline analysis and ini-
tialization of the function chain, 36.5% in JIT compilation
and the remaining is actual computation time.
Infrequent Accesses: In order to meet milliseconds-level
latencies [51], model pipelines have to reside in main
memory (possibly already warmed-up), since they can
have MBs to GBs (compressed) size on disk, with loading
and initialization times easily exceeding several seconds.
A common practice in production settings is to unload
a pipeline if not accessed after a certain period of time
(e.g., a few hours). Once evicted, successive accesses will
incur a model loading penalty and warming-up, therefore
violating Service Level Agreement (SLA).
Operator-at-a-time Model: As previously described,
predictions over ML2 pipelines are computed by pulling
records through a sequence of operators, each of them
transforming to the input vector(s) and producing one
or more new vectors. While (as is common practice for
in-memory data-intensive systems [38, 48, 18]) some in-
terpretation overheads are eliminated via JIT compilation,
operators in ML2 (and in other tools) are “logical” entities
(e.g., linear regression, tokenizer, one-hot encoder, etc.)
with diverse performance characteristics. Figure 5 shows
the latency breakdown of one execution of the sentiment
analysis pipeline of Figure 1, where the only ML opera-
tor (linear regression) takes two orders-of-magnitude less
time with respect to the slowest operator (WordNgram).
It is common practice for in-memory data-intensive sys-
tems to pipeline operators in order to minimize memory
accesses for memory-intensive workloads, and to vec-
torize compute intensive operators in order to minimize
the number of instructions per data item [26, 56]. ML2
operator-at-a-time model [56] (as other libraries missing
an optimization layer, such as Scikit-learn) is therefore
sub-optimal in that computation is organized around log-
ical operators, ignoring how those operators behave to-
gether: in the example of the sentiment analysis pipeline
at hand, linear regression is commutative and associative
(e.g., dot product between vectors) and can be pipelined
with Char and WordNgram, eliminating the need for the
Concat operation. As we will see in following Sections,
PRETZEL optimizer is able to detect this situation and to
generate an execution plan that is several times faster with
regard to ML2 unoptimized version of the pipeline.
Coarse Grained Scheduling: Scheduling CPU re-
sources carefully is essential to serve highly concurrent
requests and run machines to maximum utilization. Under
the black box approach: (1) a thread pool is used to serve
multiple concurrent requests to the same model pipeline;
(2) for each request, one thread handles the execution of
• No state sharing: state is
duplicated -> memory is wasted
Limited performance
Limited density

• Optimisations for single operators, like DNNs [4-5]
• TensorFlow Serving [6] as Servable Python objects
• ML.NET as zip files with state files and DLLs
• Clipper [7] and Rafiki [8] deploy pipelines as Docker containers
– They schedule requests based on latency target
– Can apply caching and batching
• MauveDB [9] accepts regression and interpolation model and
optimises them as DB views
8
Related work
[4] https://www.microsoft.com/en-us/research/blog/microsoft-unveils-project-brainwave/
[5] In-Datacenter Performance Analysis of a Tensor Processing Unit, arXiv, Apr 2017
[6] C.Olston, et al., and V. Rajashekhar. Tensorflow - serving: Flexible, high-performance ml serving. In Work-shop on ML Systems at NIPS, 2017
[7] D. Crankshaw, et al.. Clipper: A low-latency online prediction serving system. In NSDI, 2017
[8] W.Wang,S.Wang,J.Gao, et al.. Rafiki: Machine Learning as an Analytics Service System. ArXiv e-prints, Apr. 2018
[9] A. Deshpande and S. Madden. Mauvedb: Supporting model-based user views in database systems. In SIGMOD, 2006

PRETZEL, WHITE-BOX SERVING SYSTEM
9

White Box Prediction Serving: make pipelines co-exist
better, schedule better
1. End-to-end Optimizations: merge operators in
computational units, to decrease overheads
2. Multi-model Optimizations: create once, use
everywhere, for both data and stages
10
Design principles

SELECT * FROM Model2 WHERE …
JOIN WORD-NGRAM
11
Afterthought: scoring as querying
good:0
bad:1
…
This is a good
product
WORD-NGRAM
“This is a good product”

12
Off-line phase - Flour+Ovenages the dynamic decisions on how to schedule plans
based on machine workload. Finally, a FrontEnd is used
to submit prediction requests to the system.
Model pipeline deployment and serving in PRETZEL
follow a two-phase process. During the off-line phase
(Section 4.1), ML2’s pre-trained pipelines are translated
into Flour transformation-based API. Oven optimizer re-
arranges and fuses transformations into model plans com-
posed of parameterized logical units called stages. Each
logical stage is then AOT compiled into physical computa-
tion units where memory resources and threads are pooled
at runtime. Model plans are registered for prediction serv-
ing in the Runtime where physical stages and parameters
are shared between pipelines with similar model plans. In
the on-line phase (Section 4.2), when an inference request
for a registered model plan is received, physical stages are
parameterized dynamically with the proper values main-
tained in the Object Store. The Scheduler is in charge of
binding physical stages to shared execution units.
Figures 6 and 7 pictorially summarize the above de-
scriptions; note that only the on-line phase is executed
at inference time, whereas the model plans are generated
completely off-line. Next, we will describe each layer
composing the PRETZEL prediction system.
4.1 Off-line Phase
4.1.1 Flour
The goal of Flour is to provide an intermediate represen-
tation between ML frameworks (currently only ML2) and
PRETZEL that is both easy to target and amenable to op-
timizations. Once a pipeline is ported into Flour, it can
be optimized and compiled (Section 4.1.2) into a model
plan before getting fed into PRETZEL Runtime for on-
line scoring. Flour is a language-integrated API similar
to KeystoneML [45], RDDs [53] or LINQ [35] where
sequences of transformations are chained into DAGs and
lazily compiled for execution.
Listing 1 shows how the sentiment analysis pipeline of
Figure 1 can be expressed in Flour. Flour programs are
composed by transformations where a one-to-many map-
ping exists between ML2 operators and Flour transforma-
tions (i.e., one operator in ML2 can be mapped to many
transformations in Flour). Each Flour program starts from
a FlourContext object wrapping the Object Store.
Subsequent method calls define a DAG of transforma-
tions, which will end with a call to plan to instantiate
the model plan before feeding it into PRETZEL Runtime.
For example, in line 3 of Figure 1 the CSV.FromText
call is used to specify that the target DAG accepts as in-
put text in csv format where fields and fieldsType
arrays indicate the number and type of input fields. The
successive call to Tokenize in line 4 splits the input
fields into tokens. Lines 6 and 7 contain the two branches
defining the char-level and word-level n-gram transforma-
tions, which are then merged with the Concat transform
in line 9 before the linear binary classifier of line 10. Both
char and word n-gram transformations are parametrized
by the number of n-grams and maps translating n-grams
into numerical format (not shown in the Figure). Addi-
tionally, each Flour transformation accepts as input an
optional set of statistics gathered from training. These
statistics are used by the compiler to generate physical
plans more efficiently tailored to the model characteristics.
Example statistics are max vector size (to define the mini-
mum size of vectors to fetch from the pool at prediction
time 4.2), dense or sparse representations, etc.
Listing 1: Flour program for the sentiment analysis
pipeline. Transformations’ parameters are extracted from
the original ML2 pipeline.
1 var fContext = new FlourContext(objectStore, ...)
2 var tTokenizer = fContext.CSV.
3 FromText(fields, fieldsType,
sep).
4 Tokenize();
5
6 var tCNgram = tTokenizer.CharNgram(numCNgrms,
...);
7 var tWNgram = tTokenizer.WordNgram(numWNgrms,
...);
8 var fPrgrm = tCNgram.
9 Concat(tWNgram).
10 ClassifierBinaryLinear(cParams);
11
12 return fPrgrm.Plan();
We have instrumented the ML2 library to collect statis-
tics from training and with the related bindings to the
Object Store and Flour to automatically extract Flour
programs from pipelines once trained.
4.1.2 Oven
With Oven, our goal is to bring query compilation and
optimization techniques into ML2.
Optimizer: When plan is called on a Flour transforma-
tion’s reference (e.g., fProgram in line 12 of Figure 1),
all transformations leading to it are wrapped and analyzed.
Oven follows the typical rule-based database optimizer de-
sign where operator graphs (query plans) are transformed
by a set of rules until a fixed point is reached (i.e., the
graph does not change after the application of any rule).
The goal of Oven Optimizer is to transform an input graph
of Flour transformations into a stage graph, where each
stage contains one or more transformations. To group
transformations into stages we used the Tupleware’s hy-
brid approach [26]: memory-intensive transformations
6
PRETZEL: Opening the Black Box of Machine Learning
Prediction Serving Systems
Paper # 355
Abstract
Machine Learning models are often composed of
pipelines of transformations. While this design allows to
efficiently execute single model components at training-
time, prediction serving has different requirements such
as low latency, high throughput and graceful performance
degradation under heavy load. Current prediction serv-
ing systems consider models as black boxes, whereby
prediction-time-specific optimizations are ignored in fa-
vor of ease of deployment. In this paper, we present
PRETZEL, a prediction serving system introducing a
novel white box architecture enabling both end-to-end
and multi-model optimizations. Using production-like
model pipelines, our experiments show that PRETZEL is
able to introduce performance improvements over differ-
ent dimensions; compared to state-of-the-art approaches
PRETZEL is able to reduce latency by up to 100⇥ while
reducing memory footprint by up to 7⇥, and increasing
throughput by up to 2.5⇥.
1 Introduction
Many Machine Learning (ML) frameworks such as
Google TensorFlow [2], Facebook Caffe2 [4], Scikit-
learn [41], or CompanyX’s Internal ML Library (ML2)
allow data scientists to declaratively author pipelines of
transformations to train models from large-scale input
datasets. Model pipelines are internally represented as
Directed Acyclic Graphs (DAGs) of operators comprising
data transformations and featurizers (e.g., string tokeniza-
tion, hashing, etc.), and ML models (e.g., decision trees,
linear models, SVMs, etc.). Figure 1 shows an example
pipeline for text analysis whereby input sentences are
classified according to the expressed sentiment.
Machine Learning is usually conceptualized as a two-
steps process: first, during training model parameters are
estimated from large datasets by running computation-
ally intensive iterative algorithms; successively, trained
pipelines are used for inference to generate predictions
through the estimated model parameters. When trained
“This is a nice product”
Positive vs. Negative
Tokenizer
Char
Ngram
Word
Ngram
Concat
Linear
Regression
Figure 1: A sentimental analysis pipeline consisting of
operators for featurization (ellipses), followed by a ML
model (diamond). Tokenizer extracts tokens (e.g., words)
from the input string. Char and Word Ngrams featurize
input tokens by extracting n-grams. Concat generates a
unique feature vector which is then scored by a Linear
Regression predictor. This is a simplification: the actual
DAG contains about 12 operators.
pipelines are served for inference, the full set of operators
is deployed altogether.However, pipelines have different
system characteristics based on the phase in which they
are employed: for instance, at training time ML models
run complex algorithms to scale over large datasets (e.g.,
linear models can use gradient descent in one of its many
flavors [43, 42, 44]), while, once trained, they behave as
other regular featurizers and data transformations; further-
more, during inference pipelines are often surfaced for
direct users’ servicing and therefore require low latency,
high throughput, and graceful degradation of performance
in case of load spikes.
Existing prediction serving systems, such as Clip-
per [7, 25], TensorFlow Serving [3, 39], Rafiki [49], ML2
itself, and others [13, 14, 36, 12] focus mainly on ease
of deployment, where pipelines are considered as black
boxes and deployed into containers (e.g., Docker [9] in
Clipper and Rafiki, servables in TensorFlow Serving).
Under this strategy, only “pipeline-agnostic” optimiza-
tions such as caching, batching and buffering are avail-
able. Nevertheless, we found that black box approaches
fell short on several aspects. For instance, prediction ser-
1
var fContext = ...;
var Tokenizer = ...;
return fPrgm.Plan();
(1) Flour
Transforms
Logical Stages
S1 S2 S3 1: [x]
2: [y,z]
3: …
int[100]
float[200]
…
Params Stats
Physical Stages
S1 S2 S3
(3) Compilation
(2) Optimization
Model
Stats
Params
Logical
Stages
Physical
Stages
Model Plan
Figure 6: Model optimization and compilation in PRET-
ZEL. In (1), a model is translated into a Flour program. (2)
Oven Optimizer generates a DAG of logical stages from
the program. Additionally, parameters and statistics are
extracted. (3) A DAG of physical stages is generated by
the Oven Compiler using logical stages, parameters, and
statistics. A model plan is the union of all the elements
and is fed to the runtime.
(such as most featurizers) are pipelined together in a sin-
gle pass over the data. This strategy achieves best data
locality because records are likely to reside in CPU reg-
isters [33, 38]. Compute-intensive transformations (e.g.,
vectors or matrices multiplications) are executed one-at-
a-time so that Single Instruction, Multiple Data (SIMD)
vectorization can be exploited, therefore optimizing the
number of instructions per record [56, 20]. Stages are gen-
erated by traversing the Flour transformation graph until a
pipeline-breaking transformation is found: all transforma-
tions up to that point are then grouped into a single stage.
Within each stage, compute-bound operations are vec-
torized. This strategy allows PRETZEL to considerably
decrease the number of vectors (and as a consequence
memory) with respect to the operator-at-a-time strategy
of ML2. Dependencies between stages are created as ag-
gregation of the dependencies between the internal trans-
recognize that the Linear Regression can be pushed into
Char and WordNgram, therefore bypassing the execution
of Concat. Additionally, Tokenizer can be reused between
Char and WordNgram, therefore it will be pipelined with
CharNgram (in one stage) and a dependency between
CharNgram and WordNGram (in another stage) will be
created. The final plan will therefore be composed by 2
stages, versus the initial 4 operators (and vectors) of ML2.
Model Plan Compiler: Model plans are composed by
two DAGs: a DAG composed of logical stages, and a
DAG of physical stages. Logical stages are an abstrac-
tion of the stages output of the Oven Optimizer, with re-
lated parameters; physical stages contains the actual code
that will be executed by the PRETZEL runtime. For.each
given DAG, there is a 1-to-n mapping between logical to
physical stages so that a logical stage can represent the
execution code of different physical implementations. A
physical implementation is selected based on the parame-
ters characterizing a logical stage and available statistics.
Plan compilation is a two step process. After the stage
DAG is generated by the Oven Optimizer, the Model
Plan Compiler (MPC) maps each stage into its logical
representation containing all the parameters for the trans-
formations composing the original stage generated by the
optimizer. Parameters are saved for reuse in the Object
Store (Section 4.1.3). Once the logical plan is generated,
MPC traverses the DAG in topological order and maps
each logical stage into a physical implementation. Phys-
ical implementations are AOT-compiled, parameterized,
lock-free computation units. Each physical stage can be
seen as a parametric function which will be dynamically
fed at runtime with the proper data vectors and pipeline-
specific parameters. This design allows PRETZEL runtime
to share the same physical implementation between mul-
tiple pipelines and no memory allocation occurs on the
prediction-path (more details in Section 4.2.1). Logical
plans maintain the mapping between the pipeline-specific
parameters saved in the Object Store and the physical
stages executing on the runtime. Figure 6 summarized the
process of generating model plans out of ML2 pipelines.
var fContext = ...;
(1) Flour
Transforms
Logical Stages
S1 S2 S3 1: [x]
2: [y,z]
3: …
int[100]
float[200]
…
Params Stats
Physical Stages
S1 S2 S3
(3) Compilation
(2) Optimization
Model
Stats
Params
Logical
Stages
Physical
Stages
Model Plan
formations. Transformation classes are annotated (e.g.,
1-to-1, 1-to-n, memory-bound, compute-bound, commu-
tative and associative) to ease the optimization process:
no dynamic compilation [26] is necessary since the set
of operators is fixed and manual annotation is sufficient
to generate properly optimized plans 4. In the example
sentiment analysis pipeline of Figure 1, Oven is able to
4Note that ML2 does provide a second order operator accepting
arbitrary code requiring dynamic compilation. However, this is not
supported in our current version of PRETZEL.
4.1.3 Object Store
The motivation behind Object Store is based on the in-
sights of Figure 3: since many DAGs have similar struc-
tures, sharing operators’ state (parameters) can consid-
erably improve memory footprint, and consequently the
number of predictions served per machine. An example
is language dictionaries used for input text featurization,
which are often in common among many models and are
relatively large. The Object Store is populated off-line by
var fContext = ...;
(1) Flour
Transforms
Logical Stages
S1 S2 S3 1: [x]
2: [y,z]
3: …
int[100]
float[200]
…
Params Stats
Physical Stages
S1 S2 S3
(3) Compilation
(2) Optimization
Model
Stats
Params
Logical
Stages
Physical
Stages
Model Plan
4.1.3 Object Store
var fContext = ...;
(1) Flour
Transforms
Logical Stages
S1 S2 S3 1: [x]
2: [y,z]
3: …
int[100]
float[200]
…
Params Stats
Physical Stages
S1 S2 S3
(3) Compilation
(2) Optimization
Model
Stats
Params
Logical
Stages
Physical
Stages
Model Plan
Char and WordNgram, therefore bypassing the e
of Concat. Additionally, Tokenizer can be reused
Char and WordNgram, therefore it will be pipelin
CharNgram (in one stage) and a dependency
CharNgram and WordNGram (in another stage
created. The final plan will therefore be compo
stages, versus the initial 4 operators (and vectors)
Model Plan Compiler: Model plans are comp
two DAGs: a DAG composed of logical stage
DAG of physical stages. Logical stages are an
tion of the stages output of the Oven Optimizer,
lated parameters; physical stages contains the act
that will be executed by the PRETZEL runtime.
given DAG, there is a 1-to-n mapping between lo
physical stages so that a logical stage can repre
execution code of different physical implementa
physical implementation is selected based on the
ters characterizing a logical stage and available s
Plan compilation is a two step process. After t
DAG is generated by the Oven Optimizer, the
Plan Compiler (MPC) maps each stage into its
representation containing all the parameters for t
formations composing the original stage generate
optimizer. Parameters are saved for reuse in the
Store (Section 4.1.3). Once the logical plan is ge
MPC traverses the DAG in topological order an
each logical stage into a physical implementatio
ical implementations are AOT-compiled, param
lock-free computation units. Each physical stag
seen as a parametric function which will be dyn
fed at runtime with the proper data vectors and p
specific parameters. This design allows PRETZEL
to share the same physical implementation betw
tiple pipelines and no memory allocation occur
prediction-path (more details in Section 4.2.1).
plans maintain the mapping between the pipeline
parameters saved in the Object Store and the
stages executing on the runtime. Figure 6 summa
process of generating model plans out of ML2 p
4.1.3 Object Store
The motivation behind Object Store is based o
sights of Figure 3: since many DAGs have simil
tures, sharing operators’ state (parameters) can
erably improve memory footprint, and conseque
number of predictions served per machine. An e
is language dictionaries used for input text featu
which are often in common among many models
relatively large. The Object Store is populated of
7
Oven
Optimiser
var fContext = ...;
(1) Flour
Transforms
Logical Stages
S1 S2 S3 1: [x]
2: [y,z]
3: …
int[100]
float[200]
…
Params Stats
Physical Stages
S1 S2 S3
(3) Compilation
(2) Optimization
Model
Stats
Params
Logical
Stages
Physical
Stages
Model Plan
4.1.3 Object Store
7
var fContext = ...;
(1) Flour
Transforms
Logical Stages
S1 S2 S3 1: [x]
2: [y,z]
3: …
int[100]
float[200]
…
Params Stats
Physical Stages
S1 S2 S3
(3) Compilation
(2) Optimization
Model
Stats
Params
Logical
Stages
Physical
Stages
Model Plan
4.1.3 Object Store
Flour API

• Two main components:
– Runtime, with an Object Store
– Scheduler
• Runtime handles physical resources: threads and buffers
• Object Store caches objects of all models
– models register and retrieve state objects via a key
• Scheduler is event-based, each stage being an event
13
On-line phase

• Two model classes written in ML.NET, running in ML.NET and
Pretzel
– 250 Sentiment-Analysis models
– 50 Regression Task models
• Testbed representing a small production server
– 2 8-core Xeon E5-2620 v4 at 2.10 GHz, HT enabled
– 32 GB RAM
– Windows 10
15
Workload and testbed

• 7x less memory for SA models, 40% for RT models
• Higher model density, higher efficiency and
profitability
16
Memory
Figure 8: Cumulative memory usage of the model
pipelines with and without Object Store, normalized by
the maximum usage with Object Store for SA models.
By sharing common objects, we can decrease memory
usage by up to 7 ⇥ in SA and around 40% for RT models.
Keeping track of plan parameters can also help reduce the
time for loading models: as expected PRETZEL can load
300 models 7.4 times faster compared to ML2.
5.2 Latency
In this experiment we study the latency behavior of PRET-
ZEL and compare it with ML2. We also study hot and cold
scenarios. Specifically, we label as cold the first predic-
tion requested for a model; the successive 10 predictions
are then discarded and we report hot numbers as the aver-
age of the following 100 predictions. Inference requests
are submitted sequentially and in isolation for one model
at a time. For PRETZEL we use the request-response en-
gine. The comparison between PRETZEL and ML2 is
reported in Figure 9. As we can see, in ML2 the 99th per-
centile latency in cold (denoted by P99cold) is more than
an order-of-magnitude slower than P99hot, with a peak of
almost a 300⇥ slowdown in the worst case. Conversely,
in PRETZEL P99cold is within 3⇥ from the P99hot and
Figure 9: Latency comparison between ML2 and PRET-
ZEL, with values normalized over ML2’s P99hot latency.
Cliffs indicate pipelines with big differences in latency.
Figure 10: Latency of PRETZEL to run SA models with
and without sub-plan materialization, normalized by the
P99hot latency of the value obtained without the optimiza-
tion. Around 85% of SA pipelines show 2.5⇥ speedup.
Sub-plan materialization does not apply for RT pipelines.
sult caching enabled, latency decreases even further when
same inferences are requested for the same model. In this
case the latency is proportional to a dictionary look up.
5.3 Throughput
In this experiment, we assume a batch scenario where all
300 models are scored several times using a fixed batch

• Hot vs Cold scenarios
• Without and with Caching of partial results
17
Latency
8: Cumulative memory usage of the model
es with and without Object Store, normalized by
ximum usage with Object Store for SA models.
ring common objects, we can decrease memory
y up to 7 ⇥ in SA and around 40% for RT models.
g track of plan parameters can also help reduce the
r loading models: as expected PRETZEL can load
dels 7.4 times faster compared to ML2.
Latency
xperiment we study the latency behavior of PRET-
d compare it with ML2. We also study hot and cold
os. Specifically, we label as cold the first predic-
uested for a model; the successive 10 predictions
discarded and we report hot numbers as the aver-
he following 100 predictions. Inference requests
mitted sequentially and in isolation for one model
e. For PRETZEL we use the request-response en-
The comparison between PRETZEL and ML2 is
d in Figure 9. As we can see, in ML2 the 99th per-
atency in cold (denoted by P99cold) is more than
r-of-magnitude slower than P99hot, with a peak of
a 300⇥ slowdown in the worst case. Conversely,
TZEL P99cold is within 3⇥ from the P99hot and
5.3 Throughput
Figure 8: Cumulative memory usage of the model
pipelines with and without Object Store, normalized by
the maximum usage with Object Store for SA models.
By sharing common objects, we can decrease memory
usage by up to 7 ⇥ in SA and around 40% for RT models.
Keeping track of plan parameters can also help reduce the
time for loading models: as expected PRETZEL can load
300 models 7.4 times faster compared to ML2.
5.2 Latency
In this experiment we study the latency behavior of PRET-
ZEL and compare it with ML2. We also study hot and cold
scenarios. Specifically, we label as cold the first predic-
tion requested for a model; the successive 10 predictions
are then discarded and we report hot numbers as the aver-
age of the following 100 predictions. Inference requests
are submitted sequentially and in isolation for one model
at a time. For PRETZEL we use the request-response en-
gine. The comparison between PRETZEL and ML2 is
reported in Figure 9. As we can see, in ML2 the 99th per-
centile latency in cold (denoted by P99cold) is more than
an order-of-magnitude slower than P99hot, with a peak of
almost a 300⇥ slowdown in the worst case. Conversely,
in PRETZEL P99cold is within 3⇥ from the P99hot and
20⇥ in the worst case (an improvement of more than 10⇥
on tail latency). If instead we compare directly PRET-
ZEL with ML2, PRETZEL is 2.6⇥ faster then ML2 in the
P99hot case, and about 15⇥ in the P99cold case. Peak per-
formance gains for PRETZEL for SA pipelines is above
50⇥, 100⇥ in the RT category.
Sub-plan Materialization: Figure 10 depicts the effect
of the sub-plan materialization in prediction latency. As
we can notice, only part of the pipelines benefit from
materializing partial results, while no pipeline shows per-
formance deterioration (except the first cold scoring of
the first model, which shows a 6⇥ slowdown). In general,
5.3 Throughput
size of 1000. We used from 2 up to 29 cores to serve
requests, while 3 cores are reserved to generate them. The
main goal is to measure the maximum number of requests
PRETZEL and ML2 can serve per second. Materialization
and caching are not enabled for this experiment.
Figure 11 shows that PRETZEL’s throughput is up to
2.5⇥ higher than ML2 (at 16 cores scale) and scales on
par with the expected maximum scaling 5 whereas ML2
suffers from decrease in scalability when the number of
CPU cores increases. This is because each thread has
its own internal copy of models whereby cache lines are
5
W/o Caching W/o and w/ Caching
• ML.NET cold: 10-300x w.r.t. hot
• Pretzel cold: 3-20x slower than hot
• Pretzel vs ML:NET: cold is 15x faster,
hot is 2.6x
• For single models runs, average
improvement is 2.5x
• More gain with multiple requests

• Batch size is 1000 inputs, multiple runs of same batch
• Delay batching: as in Clipper, batch requests and accept
higher latency
18
Throughput
Figure 11: The average throughput computed among the
300 models to process one million inputs each. We scale
the number of cores on the x-axis. PRETZEL scales lin-
early to the number of CPU cores, close to the expected
maximum throughput with Hyper Threading enabled.
not shared, thus increasing the pressure on the memory
subsystem: indeed, even if the data values are the same,
the model objects are mapped to different memory areas.
Delay Batching: As in Clipper, PRETZEL FrontEnd al-
lows users to specify a maximum delay they can wait to
maximize throughput. For each model category, Figure 12
depicts the trade-off between throughput and latency as
the delay increases. Interestingly, for such models even a
small delay helps reaching the optimal batch size.
Figure 12: Throughput and latency of SA and RT models
latency, instead, gracefully increases linearly with the
increase of the load.
Figure 13: Throughput and latency of PRETZEL under
heavy load scenario.
Reservation Scheduling: If we run the previous exper-
iment using reservation scheduling for one model, such
model does not encounter any degradation in latency (max
improvement of 3 orders-of-magnitude) as the load in-
creases, while the total throughput increases about 3%.
In the experiment, 28 cores are used for the shared load,
while one core (and related vectors) is reserved exclu-
sively for one model.
6 Limitations and Future Work
Off-line Phase: PRETZEL suffers two limitations regard-
ing Flour and Oven design. First, PRETZEL currently has
several logical and physical stages classes, one per possi-
ble implementation, which make the system difficult to
maintain in the long run. Additionally, different back-ends
(e.g., PRETZEL currently supports operators implemented
in C# and C++, and experimentally on FPGA [X]) require
all specific operator implementations. We are however
confident that this limitation will be overcome once code
with delayed batching: while throughput increases with
the delay, also record latency increase. RT latency is more
tolerant to larger delays compared to SA.
5.4 Heavy Load
In this last experiment we load all 300 models in one in-
stance and we submit requests to all models concurrently
following a Zipf distribution. As for the throughput exper-
iment we use 3 threads to generate load and 29 to serve
requests. Figure 13 reports how throughput and latency
vary as we increase the load. As expected, PRETZEL’s
throughput increases initially and stabilizes. PRETZEL’s
Figure 13: Throughput and latency of PRET
Reservation Scheduling: If we run the prev
iment using reservation scheduling for one m
model does not encounter any degradation in la
improvement of 3 orders-of-magnitude) as t
creases, while the total throughput increases
In the experiment, 28 cores are used for the s
while one core (and related vectors) is reser
Off-line Phase: PRETZEL suffers two limitati
ing Flour and Oven design. First, PRETZEL cu
several logical and physical stages classes, on
ble implementation, which make the system
maintain in the long run. Additionally, differen
(e.g., PRETZEL currently supports operators im
in C# and C++, and experimentally on FPGA [
all specific operator implementations. We ar
confident that this limitation will be overcome
generation of stages will be added (e.g., with
specific templates [34]). Secondly, Flour and
currently limited to pipelines authored in ML
ing models from different frameworks to the
approach may require non-trivial work. On th
our goal is, however, to target unified model fo
as ONNX [5]; this will allow us to apply the
techniques to models from other ML framewo
On-line Phase: PRETZEL’s fine-grained, s
scheduling may introduce additional overhea
spect to coarse-grained whole pipeline schedu
additional buffering and context switching.
heads are, however, related to the system load
fore controllable by the scheduler. On white bo
tures, failures happening during the execution
11
W/o Delay batching
• ML.NET suffers from missed data
sharing, i.e. higher memory traffic
W/ Delay batching
• Not supported in ML.NET
• Quickly reaches optimal batch size

CONCLUSIONS AND FUTURE WORK
19

• We addressed performance/density bottlenecks in ML
inference for MaaS
• We advocate the adoption of a white-box approach
• We are re-thinking ML inference as a DB query problem
- buffer/locality issues
- code generation
- runtime, hardware-aware scheduling
20
Conclusions

• Oven has pre-written implementations for logical and
physical stages: not maintainable in the long run
• We are adding code-generation of stages, maybe with
hardware-specific templates [10]
• Tune code-generation for more-aggressive optimizations
• Support more model formats than ML.NET, like ONNX [11]
21
Off-line phase
[10] K.Krikellas, S.Viglas,et al. Generating code for holistic query evaluation, in ICDE, pages 613–624. IEEE Computer Society, 2010
[11] Open Neural Network Exchange (ONNX). https://onnx.ai, 2017

• Distributed version
• Main work is on NUMA awareness, to improve locality and
scheduling
– Schedule on the NUMA node where most shared state
resides
– Duplicate hot, shared state objects on nodes for lower
latency
– Scale better, avoiding bottlenecks on coherence bus
22
On-line phase
Pretzel is currently under submission at OSDI ‘18
QUESTIONS ?
• Distributed version
• Main work is on NUMA awareness, to improve locality and
scheduling
– Schedule on the NUMA node where most shared state
resides
– Duplicate hot, shared state objects on nodes for lower
latency
– Scale better, avoiding bottlenecks on coherence bus
Speaker: Alberto Scolari, PhD student @ Politecnico di Milano, Italy - alberto.scolari@polimi.it

• Oven’s Model Plan Compiler compiles logical stages into
physical stages, based on parameters and statistics
24
Off-line phase - Oven
[9] A.Crotty,A.Galakatos,K.Dursun, et al. An architecture for compiling UDF-centric workﬂows. PVLDB, 8(12):1466–1477, Aug. 2015
var fContext = ...;
(1) Flour
Transforms
Logical Stages
S1 S2 S3 1: [x]
2: [y,z]
3: …
int[100]
ﬂoat[200]
…
Params Stats
Physical Stages
S1 S2 S3
(3) Compilation
(2) Optimization
Model
Stats
Params
Logical
Stages
Physical
Stages
Model Plan
4.1.3 Object Store
7
• Oven optimises pipelines much like
queries
• Tupleware’s hybrid approach [9]
– memory-intensive operators are
merged into a logical stage, to
increase data locality
– Compute-intensive operators stay
alone, to leverage SIMD execution
– One-to-one operators are chained
together, while one-to-many can
break locality

• Load all 300 models upfront
• 3 threads to generate requests, 29 to score
• One core serves single requests, the others serve batches
• ML.NET does not support this scenarios
• Latency degrades gracefully with load
25
Heavy load
Figure 11: The average throughput computed among the
latency, instead, gracefully increases linearly with the
increase of the load.
Figure 13: Throughput and latency of PRETZEL under
Reservation Scheduling: If we run the previous exper-
iment using reservation scheduling for one model, such
model does not encounter any degradation in latency (max
improvement of 3 orders-of-magnitude) as the load in-
creases, while the total throughput increases about 3%.
In the experiment, 28 cores are used for the shared load,
while one core (and related vectors) is reserved exclu-
Off-line Phase: PRETZEL suffers two limitations regard-
ing Flour and Oven design. First, PRETZEL currently has
several logical and physical stages classes, one per possi-
ble implementation, which make the system difficult to
maintain in the long run. Additionally, different back-ends
(e.g., PRETZEL currently supports operators implemented
in C# and C++, and experimentally on FPGA [X]) require
all specific operator implementations. We are however
confident that this limitation will be overcome once code
generation of stages will be added (e.g., with hardware-

Pretzel: optimized Machine Learning framework for low-latency and high throughput workloads

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Was ist angesagt?

Was ist angesagt? (20)

Ähnlich wie Pretzel: optimized Machine Learning framework for low-latency and high throughput workloads

Ähnlich wie Pretzel: optimized Machine Learning framework for low-latency and high throughput workloads (20)

Mehr von NECST Lab @ Politecnico di Milano

Mehr von NECST Lab @ Politecnico di Milano (20)

Kürzlich hochgeladen

Kürzlich hochgeladen (20)

Pretzel: optimized Machine Learning framework for low-latency and high throughput workloads