XGBoostLSS - An extension of XGBoost to probabilistic forecasting, Alexander Maerz

Erlangen
Artificial Intelligence &
Machine Learning Meetup
presents

XGBoostLSS
An extension of XGBoost to probabilistic forecasting
Dr. Alexander M¨arz
Erlangen Artiﬁcial Intelligence & Machine Learning Meetup
November 14, 2019

Table of Contents
1. Embracing Uncertainty
2. Distributional Modelling
3. Gradient Boosting Re-visited
4. XGBoostLSS
Estimation
Simulation Example
Real World Example: Modelling Munich Rents

What does ML mean for you?
ML Framework
Maximum Likelihood
• Inference
• Quantiﬁcation of uncertainty (Fisher
Information Matrix)
Machine Learning
• Modelling ⇡ function optimization
• Focus on prediction accuracy
2

What does ML mean for you?
ML Framework
Maximum Likelihood
• Inference
• Quantiﬁcation of uncertainty (Fisher
Information Matrix)
Machine Learning
• Modelling ⇡ function optimization
• Focus on prediction accuracy
3

Fundamental Principle 1
4!
To reason rigorously under uncertainty, we need to invoke the
language of probability and statistics!
⇤Zhang, A. et al. (2019). Dive into Deep Learning. http://d2l.ai/index.html
4

Motivating Example
4.47
5.95
89.58
4.46
6.13
89.4
4.47
5.95
89.58
4.47
5.95
89.58
4.47
6.02
89.51
4.47
5.95
89.58
7.47
86.35
6.17
7.47
86.35
6.17
7.47
86.35
6.17
89.76
6.32
3.92
89.73
6.35
3.92
89.76
6.32
3.92
89.76
6.32
3.92
89.73
6.35
3.92
78.48
15.85
5.67
89.73
6.35
3.92
89.73
6.35
3.92
89.76
6.32
3.92
89.76
6.32
3.92
89.76
6.32
3.92
0
25
50
75
100
0 5 10 15 20
Observation
ClassProbability
Species
setosa
versicolor
virginica
Iris Data − Classification Probabilities
Classiﬁcation tasks are probabilistic by default: what is the most likely label ˆyc given the
features, i.e., ˆyc = arg max p( | ), c = 1, . . . , C.
5

Motivating Example
8
10
12
14
0.00 0.25 0.50 0.75 1.00
x
y
6

Motivating Example
8
10
12
14
0.00 0.25 0.50 0.75 1.00
x
y
Never should a model be absolutely conﬁdent about a prediction!
7

Fundamental Principle 2
4!
It’s far better to be approximately correct than exactly wrong!
⇤https://www.lokad.com/probabilistic-forecasting
8

Motivating Example
8
10
12
14
0.00 0.25 0.50 0.75 1.00
x
y
9

Motivating Example
8
10
12
14
0.00 0.25 0.50 0.75 1.00
x
y
10

Embracing Uncertainty: Probabilistic Forecasting
Traditional point-forecasting approaches are expected to produce correct ﬁgures.
Naturally, however, the future is uncertain. Instead of taking one possible future
in to account, probabilistic forecasts assign probabilities to di↵erent outcomes.
• Probabilistic forecasts provide a realistic way of looking at the future:
• instead of hoping point-forecasts to materialize, probabilistic forecasts remind you that
everything is possible, just not quite equally probable.
⇤https://www.lokad.com/probabilistic-forecasting
11

Probabilistic forecasts are predictions in the form of a probability distribution,
rather than a single point estimate only.
12

Old: What is the average value of an outcome, given the features?
New: What are the probabilities of an outcome, given the features?
13

14

Distributional Modelling
The ultimate goal of regression analysis is to obtain information about the [entire]
conditional distribution of a response given a set of explanatory variables.
• Focus of Machine-Learning is mainly on modelling E(Y |X = x) = f (x), e.g., splitting
procedures in CART favour the detection of changes in the mean.
• In general, they have very low power for detecting other patterns (e.g., changes in
variance) even if these can be related to covariates.
• Information about the entire conditional distribution P(Y  y|X = x) = FY (y|x) is not
available.
⇤Hothorn, T. et al. (2014). Conditional transformation models. JRSS: Series B (Statistical Methodology) 76(1), 3–27.
15

x1
x2
x3
x
y
ConditionalDensityf(y|x)
16

Relate all distributional parameters to explanatory variables
yi
ind
⇠ D
⇣
h1(#i1) = ⌘i1, h2(#i2) = ⌘i2, . . . , hK (#iK ) = ⌘iK
⌘
, i = 1, . . . , n
with a ﬂexible predictor
⌘k = fk ( ), k = 1, . . . , K
where fk (·) can take on several forms
• fk ( ) = Xk k +
Ppk
j=1 fk,j ( j )
• fk ( ) = Random Forest, Gradient Boosting Trees, Neural Network, . . . 17

• Restrictive assumption of strong stationarity in (time series) modelling
y
iid
⇠ D µ( ), #
FY (yt1 , . . . , ytn ) = FY (yt1+⌧ , . . . , ytn+⌧ ), 8 n, t1, . . . , tn, ⌧
• As all distributional parameters are functions of covariates, we shift the entire
distribution forward in time
y
ind
⇠ D #( )
18

Mean 95%−Quantile5%−Quantile
0.000
0.025
0.050
0.075
0.100
0.125
90 95 100 105 110
Forecast for Obs1
f(y|x)
Probabilistic Forecast − Density Plot
22

0.00
0.25
0.50
0.75
1.00
0 50 100
Forecast
F(y|x)
Probabilistic Forecasts − Cumulative Distribution Plots
24

Gradient Descent Boosting
Sketch of Algorithm
1. Set ˆf (x) = 0 and ri = yi for all i = 1, . . . , N in the training set
2. For t = 1, 2, . . . , T, repeat:
(a) Fit a tree ˆf t
to the training data (X, r)
(b) Update ˆf by adding a shrunken version of the new estimate
ˆf t
(x) = ˆf t 1
(x) + ⌘ˆf t
(x)
(c) Update the residuals
ri = ri ⌘ˆf t
(xi )
3. Output the boosted model
ˆf (x) =
TX
t=1
⌘ˆf t
(x)
25

• Boosting iteratively ﬁts a tree to the residuals from the previous model.
• To see the connection between residuals and the Gradient, we need to go back to
Maximum Likelihood.
• In the following, we assume the data being generated as
y = f (x) + ✏
• Since most Machine Learning models focus on the conditional mean only, we have
E(y|x) = f (x)
26

• Most boosting models implicitly assume a Normal distribution as a default loss function,
with all higher moments being constant.
• For a Gaussian N f (x), 2
with density
f
⇣
y | f (x), 2
⌘
=
nY
i=1
1
p
2⇡ 2
exp
y f (x)
2
2 2
!
the corresponding log-likelihood is
log
⇣
L(f (x), )
⌘
=
n
2
ln(2⇡ 2
)
1
2 2
nX
i=1
yi f (xi )
2
27

• We focus on the Kernel of the Normal, where 2
is treated as a nuisance parameter
K =
1
2
y f (x)
2
• Since we want to minimize the error, we derive the negative Gradient
K
f (x)
=
f (x)
1
2
y f (x)
2
= y f (x) = r
• Assuming a Normal distribution with L2-loss, the negative Gradient are just the residuals.
Maximizing negative log-likelihood = Minimizing empirical risk function
28

Quadratic Loss
29

2
4
6
8
10
10
12
12
14
14
16
16
18
18
20
20
20
20
22
22
22
22
24
24
24
24
26
26
26
26
30

2
4
6
8
10
10
12
12
14
14
16
16
18
18
20
20
20
20
22
22
22
22
24
24
24
24
26
26
26
26
31

2
4
6
8
10
10
12
12
14
14
16
16
18
18
20
20
20
20
22
22
22
22
24
24
24
24
26
26
26
26
32

2
4
6
8
10
10
12
12
14
14
16
16
18
18
20
20
20
20
22
22
22
22
24
24
24
24
26
26
26
26
33

2
4
6
8
10
10
12
12
14
14
16
16
18
18
20
20
20
20
22
22
22
22
24
24
24
24
26
26
26
26
34

2
4
6
8
10
10
12
12
14
14
16
16
18
18
20
20
20
20
22
22
22
22
24
24
24
24
26
26
26
26
35

2
4
6
8
10
10
12
12
14
14
16
16
18
18
20
20
20
20
22
22
22
22
24
24
24
24
26
26
26
26
36

What is XGBoost?
October 2019October 2019October 2019October 2019October 2019October 2019October 2019October 2019October 2019October 2019October 2019October 2019October 2019October 2019October 2019October 2019October 2019October 2019October 2019October 2019October 2019October 2019October 2019October 2019October 2019October 2019October 2019October 2019October 2019October 2019October 2019October 2019October 2019October 2019October 2019October 2019October 2019October 2019October 2019October 2019October 2019October 2019October 2019October 2019October 2019October 2019October 2019October 2019October 2019October 2019October 2019October 2019October 2019October 2019October 2019October 2019October 2019October 2019October 2019October 2019October 2019October 2019
0
20,000
40,000
60,000
2016 2018 2020
Time
Downloadspermonth
Monthly Downloads of XGBoost from CRAN
38

What is XGBoost?
465,743
0
250,000
500,000
750,000
1,000,000
randomForest rpart xgboost ranger gbm tensorflow keras adabag gamlss randomForestSRC mboost bartMachine gamboostLSS bamlss
TotalDownloads
Total Downloads of CRAN Package(s) in 2019
39

XGBoost and Prediction Intervals
40

XGBoost and Prediction Intervals
• XGBoost is based on Newton-Boosting (also called second-order Gradient-Boosting or
Hessian-Boosting).
• Problem: Gradient of Quantile-Loss is a step function while the Hessian is zero everywhere
and inﬁnite at the origin.
−1.00
−0.75
−0.50
−0.25
0.00
−2 −1 0 1 2
Quantile Loss − Gradient
0.0e+00
2.5e+07
5.0e+07
7.5e+07
1.0e+08
−2 −1 0 1 2
Quantile Loss − Hessian
41

Connection between Distributional Modelling and XGBoost
• Newton boosting can be understood as an iterative empirical risk minimization procedure
in function space.
• Empirical risk minimization and Maximum-Likelihood estimation are closely related:
• GAMLSS⇤
are estimated using the 1st
and 2nd
order partial derivatives of the log-likelihood
w.r.t. to the distributional parameter #k of interest.
• By selecting an appropriate loss, or equivalently, a log-likelihood function,
Maximum-Likelihood can be formulated as empirical risk minimization.
• XGBoost can be interpreted as a statistical model.
⇤Generalized Additive Models for Location Scale and Shape
42

XGBoost - Estimation
XGBoost minimizes a regularized objective function
L(t)
=
nX
i=1
`[yi , ˆy
(t)
i ] + ⌦(ft)
=
nX
i=1
`[yi , ˆy
(t 1)
i + ft(xi )] + ⌦(ft)
Second order approximation of `[·] yields
˜L(t)
=
nX
i=1
[gi ft(xi ) +
1
2
hi f 2
t (xi )] + ⌦(ft)
with gi = @ˆy(t 1) `[yi , ˆy
(t 1)
i ] and hi = @2
ˆy(t 1) `[yi , ˆy
(t 1)
i ]
43

Simulation Example
0
10
20
0.00 0.25 0.50 0.75 1.00
x
y
Simulated Data
y ⇠ N
⇣
µ = 10, = (1 + 4(0.3 < x < 0.5) + 2(x > 0.7)
⌘
46

Simulation Example
0
10
20
0.00 0.25 0.50 0.75 1.00
x
y
XGBoost Regression − Simulated Data
y ⇠ N
⇣
µ = 10, = (1 + 4(0.3 < x < 0.5) + 2(x > 0.7)
⌘
47

Simulation Example
0
10
20
0.00 0.25 0.50 0.75 1.00
x
y
Random Forest Regression − Simulated Data
y ⇠ N
⇣
µ = 10, = (1 + 4(0.3 < x < 0.5) + 2(x > 0.7)
⌘
48

Simulation Example
Total Coverage: 89.2
Upper bound: 94.7
Lower bound: 5.5
0
10
20
0.00 0.25 0.50 0.75 1.00
x
y
XGBoostLSS Regression − Simulated Data
y ⇠ N
⇣
µ = 10, = (1 + 4(0.3 < x < 0.5) + 2(x > 0.7)
⌘
49

Machine Learning Interpretability
Statistical rigour is necessary to justify the inferential leap from data to
knowledge!
50

X1
X2
X3
X5
X6
X7
X8
X4
X10
X9
x
0.0 0.2 0.4 0.6
mean(|Shapley Value|)
Feature
Variable importance for Var(y|x)
51

−0.5
0.0
0.5
1.0
0.00 0.25 0.50 0.75 1.00
x
Estimatedeffect
Estimated effect of x on Var(y|x)
52

Munich Rent Example
Rent/sqm
1.98 − 5.71
5.71 − 6.43
6.43 − 6.97
6.97 − 7.32
7.32 − 7.94
7.94 − 8.4
8.4 − 9.26
9.26 − 12.32
Munich Rents per District
53

Munich Rent Example
0 5 10 15 20
0.000.050.100.15
Munich Rent Data and the fitted NO distribution
y
f(y)
55

Munich Rent Example
1 # Load Package
2 library(" xgboostlss")
3
4 # Data
5 dtrain <- xgb.DMatrix(data = data.matrix(train[, covariates ]),
6 label = train[, dep_var ])
7 dtest <- xgb.DMatrix(data = data.matrix(test[, covariates ]))
8
9 # Fit Model
10 xgblss_model <- xgblss.train(data = dtrain ,
11 family = "NO",
12 n_init_hyper = 50,
13 time_budget = 5)
14
15 # Predict
16 xgblss_pred <- predict(xgblss_model ,
17 newdata = dtest ,
18 parameter = "all")
19
20 # Partial Dependence Plots
21 plot(xgblss_model ,
22 parameter = "all",
23 type = "pdp")
56

Munich Rent Example
−1.0
−0.5
0.0
0.5
1.0
1.5
1920 1940 1960 1980 2000
yearc
Estimatedeffect
Estimated effect of yearc on E(y|x)
−0.1
0.0
0.1
1920 1940 1960 1980 2000
yearc
Estimatedeffect
Estimated effect of yearc on Var(y|x)
0
1
2
3
50 100 150
area
Estimatedeffect
Estimated effect of area on E(y|x)
−0.5
0.0
0.5
1.0
50 100 150
area
Estimatedeffect
Estimated effect of area on Var(y|x)
58

Munich Rent Example
district.Lud.Isar
district.Maxvor
cheating.yes
upkitchen.no
bathtile.yes
rooms
cheating.no
location.normal
area
yearc
0.0 0.1 0.2 0.3 0.4 0.5
Feature
Variable importance for E(y|x)
district.Obgies
cheating.no
district.BamLaim
wwater.no
upkitchen.no
district.Alt.Le
rooms
cheating.yes
area
yearc
0.000 0.025 0.050 0.075
Feature
Variable importance for Var(y|x)
59

Munich Rent Example
location.normal=1
area=53
district.Maxvor=0
district.Alt.Le=0
district.Lud.Isar=0
cheating.no=0
bathtile.yes=1
district.Send=1
upkitchen.no=0
yearc=1992
0.0 0.4 0.8 1.2
Shapley Value
Feature
Effect
negative
positive
Actual Prediction: 10.35
Average Prediction: 8.47
Dist. Parameter: E(y|x)
60

XGBoostLSS Beneﬁts
• Extends XGBoost to probabilistic forecasting.
• Valid uncertainty quantiﬁcation of forecasts.
• More than 80 available distributions (continuous, discrete and mixed discrete-continuous).
• High interpretability of results.
• Compatible with all XGBoost implementations, i.e., R, Julia, Python, . . .
• Parallelized model training (currently only CPU).
• Compatibility with large datasets (> 1 Mio. rows).
62

Thanks for your attention!
https://github.com/StatMixedML/XGBoostLSS
M¨arz, Alexander (2019). XGBoostLSS - An extension of XGBoost to probabilistic forecasting.
https://arxiv.org/abs/1907.03178v4
63

To learn more about the meetup, click the Link
https://www.meetup.com/Erlangen-Artificial-Intelligence-Machine-Learning-Meetup
Erlangen
Artificial Intelligence
Machine Learning Meetup
presents

XGBoostLSS - An extension of XGBoost to probabilistic forecasting, Alexander Maerz

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to XGBoostLSS - An extension of XGBoost to probabilistic forecasting, Alexander Maerz

Similar to XGBoostLSS - An extension of XGBoost to probabilistic forecasting, Alexander Maerz (20)

More from Erlangen Artificial Intelligence & Machine Learning Meetup

More from Erlangen Artificial Intelligence & Machine Learning Meetup (7)

Recently uploaded

Recently uploaded (20)

XGBoostLSS - An extension of XGBoost to probabilistic forecasting, Alexander Maerz