Tensors Are All You Need: Faster Inference with Hummingbird

Matteo Interlandi, Karla Saur
Hummingbird

Overview
Machine Learning Prediction Serving

1. Models are learned from data
Data
Training
Model
Learn

1. Models are learned from data
2. Models are deployed and served together
Prediction serving
Users
Server
Data
Training
Model
Learn Deploy

Model Serving
Specialized Systems have been developed

Model Serving
Specialized Systems have been developed Focus: Deep Learning (DL)

Model Serving
Specialized Systems have been developed
Support for traditional ML methods is largely overlooked
Focus: Deep Learning (DL)

Traditional ML Models
2019 Kaggle Survey: The State of Data Science & Machine Learning Data Science through the looking glass: https://arxiv.org/abs/1912.09536

Problem: Lack of Optimizations for
Traditional ML Serving
Systems for training traditional ML models are not optimized for serving
Traditional ML models are expressed using imperative code in an ad-hoc
fashion, not using a shared logical abstraction
Traditional ML models cannot natively exploit hardware acceleration

Tokeniz
er
How do “Traditional ML Models” look inside?
<Example: Binary Classiﬁcation>
Char
Ngram
Word
Ngram
Conca
t
Logistic
Regressio
n
0 vs 1
Traditional ML Model
A B C D
a 0.1 c 0.5

Split
Scaler
OneHot
Conca
t
Logistic
Regressio
n
DAG of Operators (aka pipeline)
A B C D
a 0.1 c 0.5 0 vs 1

Split
Scaler
OneHot
Conca
t
Logistic
Regressio
n
A B C D
a 0.1 c 0.5 0 vs 1
Featurizers

Split
Scaler
OneHot
Conca
t
Logistic
Regressio
n
Featurizers
Predictor
A B C D
a 0.1 c 0.5 0 vs 1

Split
Scaler
OneHot
Conca
t
Logistic
Regressio
n
Split input
into
cat num
A B C D
a 0.1 c 0.5 0 vs 1

Split
Scaler
OneHot
Conca
t
Logistic
Regressio
n
Normalize
num
A B C D
a 0.1 c 0.5 0 vs 1
Split input
into
cat num One hot
encode cat

Split
Scaler
OneHot
Conca
t
Logistic
Regressio
n
Merge two
vectors
A B C D
a 0.1 c 0.5 0 vs 1
Split input
into
cat num
Normalize
num
One hot
encode cat

Split
Scaler
OneHot
Conca
t
Logistic
Regressio
n
Merge two
vectors
Compute
ﬁnal score
A B C D
a 0.1 c 0.5 0 vs 1
Split input
into
cat num
Normalize
num
One hot
encode cat

Primarily relies on the abstraction of tensors
Deep Learning
DL models are expressed as a DAG of tensor operators
w1 b1
X
Mat
Mul
Add ReLU
Mat
Mul
Add Sigmoid
w1 b1
User
Input

Systems for DL Prediction Serving
Exploit the abstraction of tensor operations to support multiple DL frameworks
on multiple target environments
Mat
Mul
Add ReLU …
✔ Efﬁcient implementations
✔ Declarative
✔ Seamless hardware acceleration
✔ Reduced engineering efforts
Beneﬁt
s:

Hummingbird
Mat
Mul
Add ReLU …
Deep
Learning
Traditional
ML
DL Prediction
Serving Systems

Converting ML Operators into Tensor Operations
Observation: pipelines are composed of two classes of operators
Algebraic Operations: E.g., Linear Regression
Algorithmic Operations: E.g., RandomForest, OneHotEncoder
Y = wX + b
Complex data access patterns and control-ﬂow patterns!
Introduce redundancies, both computational and storage
Make data access patterns and control ﬂow uniform for all inputs
Our Solution:
Depending on the level of redundancy introduced there can be
more than one potential compilation approach
Hummingbird picks the one that works given pipeline statistics

Converting Decision tree-based models

Converting Decision tree-based models
＜=

Compiling Decision Tree-based Models
Above approach (GEMM approach) essentially evaluates all paths in a
decision tree model: computation redundancy.
Works surprisingly well on modern hardware for many cases!
Two other tree traversal-based methods that exploit the tree structure.
For tall trees (e.g., LightGBM) For bushy trees (e.g., XGBoost)

Tree Traversal Method
Initial: 0
repeat while < max tree depth
Gather
Feature Ids
Featu
re id
X
Gather Feature
value
Gather
Thresholds
Threshold
value
<
Co
nd.
Where
Lefts
Rights
Tr
ue
Fal
se

Perfect Tree Traversal Method
3
F3
< 0.5
F2
< 2.0 F5
< 5.5
C1 C2 C1
F3
< 2.4
C2 C1
true false
true true
true
false false
false
F3
< 0.5
F2
< 2.0 F5
< 5.5
C1 C1
F3
< 2.4
C2 C1
true false
true true
true
false false
false
C1 C2 C2 C1

Operator Group Supported Operators
Linear Classifiers Logistic Regression, Linear SVC, LinearSVR, SVC, SVR, NuSVC, SGDClassifier,
LogisticRegressionCV
Tree Methods DecisionTreeClassifier/Regressor, RandomForestClassifier/Regressor,
(Hist)GradientBoostingClassifier/Regressor, ExtraTreesClassifier/Regressor,
XGBClassifier/Regressor, LGBMClassifier/Regressor/Ranker
Neural Networks MLPClassifier
Others BernouliNB, Kmeans, MeanShift
Feature Selectors SelectKBest
Decomposition PCA, TruncatedSVD
Feature Pre-Processing SimpleImputer, Imputer, ColumnTransformer, RobustScaler, MaxAbsScaler,
MinMaxScaler, StandardScaler, Binarizer, KBinsDiscretizer, Normalizer,
PolynomialFeatures, OneHotEncoder, LabelEncoder, FeatureHasher
Supported Operators

High-level System Design
Hummingbird
Trained
Traditional
ML Pipelines
DL Prediction
Serving Systems

End-to-End Pipeline Evaluation
4
Hardware Setup Experimental Workload
Hummingbird can translate 2328 pipelines (88%).
Perform inference on 20% of the dataset.
TorchScript as the backend for Hummingbird.
Scikit-Learn pipelines for OpenML-CC18
benchmark which has 72 datasets.
Azure NC6 v2 machine
Intel Xeon E5-2690
v4@ 2.6GHz (6 cores)
112 GB RAM
Nvidia P100 Ubuntu 18.04,
PyTorch 1.3, TVM 0.6, CUDA 10,
RAPIDS 0.9

4
CPU
60%
1200 X
60 X

4
CPU GPU
60%
1200 X
60 X
73%
1000 X
130 X
Main reasons for slowdowns: Sparse input data, small inference datasets.

Hummingbird
Updates
• Hummingbird has reached > 21K PyPI
downloads and 2.4k stars
• Demoed at Microsoft Ignite
• Integrated with ONNX converter tools
• OSDI paper
• New features include:
• Pandas Dataframes
• PySparkML support
• TVM support
• Looking for new users/contributors!

45
Thank you!
hummingbird-dev@microsoft.com

Tree-Models Microbenchmark
46
Experimental Workload: Nvidia Gradient Boosting Algorithm Benchmark*
Dataset Rows #Features Task
Fraud 285k 28 BinaryClass
Year 512k 90 Regression
Covtype 581k 54 MultiClass
Epsilon 500k 2000 BinaryClass
3 Models: RandomForest, XGBoost, LightGBM.
80/20 train/test split.
Batch inference (batch size 10k w/ and w/o
GPU.
(* https://github.com/NVIDIA/gbm-bench)

47
Algorithm Dataset
Sklearn
(CPU Baseline)
Hummingbird (CPU) RAPIDS
(GPU Baseline)
Hummingbird (GPU)
TorchScript TVM TorchScript TVM
Rand. Forest
Fraud
Year
Covtype
Epsilon
LightGBM
Fraud
Year
Covtype
Epsilon
XGBoost
Fraud
Year
Covtype
Epsilon
(All runtimes are reported in seconds. More datasets and experimental results in the paper.)

48
Algorithm Dataset
Sklearn
(CPU Baseline)
(GPU Baseline)
Hummingbird (GPU)
Rand. Forest
Fraud 2.5 7.8 3.0
Year 1.9 7.7 1.4
Covtype 5.9 16.5 6.8
Epsilon 9.8 13.9 6.6
LightGBM
Fraud 3.4 7.6 1.7
Year 5.0 7.6 1.6
Covtype 51.1 79.5 27.2
Epsilon 10.5 14.5 4.0
XGBoost
Fraud 1.9 7.6 1.6
Year 3.1 7.6 1.6
Covtype 42.3 79.0 26.4
Epsilon 7.6 14.8 4.2

49
Algorithm Dataset
Sklearn
(CPU Baseline)
(GPU Baseline)
Hummingbird (GPU)
Rand. Forest
Fraud 2.5 7.8 3.0 !SUPPORTED 0.044 0.015
Year 1.9 7.7 1.4 !SUPPORTED 0.045 0.026
Covtype 5.9 16.5 6.8 !SUPPORTED 0.110 0.047
Epsilon 9.8 13.9 6.6 !SUPPORTED 0.130 0.13
LightGBM
Fraud 3.4 7.6 1.7 0.014 0.044 0.014
Year 5.0 7.6 1.6 0.023 0.045 0.025
Epsilon 10.5 14.5 4.0 0.150 0.130 0.120
XGBoost
Fraud 1.9 7.6 1.6 0.013 0.044 0.015
Year 3.1 7.6 1.6 0.022 0.045 0.026
Epsilon 7.6 14.8 4.2 0.150 0.130 0.120

Tensors Are All You Need: Faster Inference with Hummingbird

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Tensors Are All You Need: Faster Inference with Hummingbird

Similar to Tensors Are All You Need: Faster Inference with Hummingbird (20)

More from Databricks

More from Databricks (20)

Recently uploaded

Recently uploaded (20)

Tensors Are All You Need: Faster Inference with Hummingbird