Slides by Alexander März:
The language of statistics is of probabilistic nature. Any model that falls short of providing quantification of the uncertainty attached to its outcome is likely to provide an incomplete and potentially misleading
picture. While this is an irrevocable consensus in statistics, machine
learning approaches usually lack proper ways of quantifying uncertainty. In fact, a possible distinction between the two modelling cultures can be
attributed to the (non)-existence of uncertainty estimates that allow for,
e.g., hypothesis testing or the construction of estimation/prediction
intervals. However, quantification of uncertainty in general and
probabilistic forecasting in particular doesn’t just provide an average
point forecast, but it rather equips the user with a range of outcomes and the probability of each of those occurring.
In an effort of bringing both disciplines closer together, the audience is
introduced to a new framework of XGBoost that predicts the entire
conditional distribution of a univariate response variable. In particular,
XGBoostLSS models all moments of a parametric distribution (i.e., mean,
location, scale and shape [LSS]) instead of the conditional mean only.
Choosing from a wide range of continuous, discrete and mixed
discrete-continuous distribution, modelling and predicting the entire
conditional distribution greatly enhances the flexibility of XGBoost, as it
allows to gain additional insight into the data generating process, as well
as to create probabilistic forecasts from which prediction intervals and
quantiles of interest can be derived. As such, XGBoostLSS contributes to
the growing literature on statistical machine learning that aims at
weakening the separation between Breiman‘s „Data Modelling Culture“ and „Algorithmic Modelling Culture“, so that models designed mainly for
prediction can also be used to describe and explain the underlying data
generating process of the response of interest.
2. XGBoostLSS
An extension of XGBoost to probabilistic forecasting
Dr. Alexander M¨arz
Erlangen Artificial Intelligence & Machine Learning Meetup
November 14, 2019
3. Table of Contents
1. Embracing Uncertainty
2. Distributional Modelling
3. Gradient Boosting Re-visited
4. XGBoostLSS
Estimation
Simulation Example
Real World Example: Modelling Munich Rents
6. What does ML mean for you?
ML Framework
Maximum Likelihood
• Inference
• Quantification of uncertainty (Fisher
Information Matrix)
Machine Learning
• Modelling ⇡ function optimization
• Focus on prediction accuracy
2
7. What does ML mean for you?
ML Framework
Maximum Likelihood
• Inference
• Quantification of uncertainty (Fisher
Information Matrix)
Machine Learning
• Modelling ⇡ function optimization
• Focus on prediction accuracy
3
8. Fundamental Principle 1
4!
To reason rigorously under uncertainty, we need to invoke the
language of probability and statistics!
⇤Zhang, A. et al. (2019). Dive into Deep Learning. http://d2l.ai/index.html
4
15. Embracing Uncertainty: Probabilistic Forecasting
Traditional point-forecasting approaches are expected to produce correct figures.
Naturally, however, the future is uncertain. Instead of taking one possible future
in to account, probabilistic forecasts assign probabilities to di↵erent outcomes.
• Probabilistic forecasts provide a realistic way of looking at the future:
• instead of hoping point-forecasts to materialize, probabilistic forecasts remind you that
everything is possible, just not quite equally probable.
⇤https://www.lokad.com/probabilistic-forecasting
11
16. Embracing Uncertainty: Probabilistic Forecasting
Probabilistic forecasts are predictions in the form of a probability distribution,
rather than a single point estimate only.
12
17. Embracing Uncertainty: Probabilistic Forecasting
Old: What is the average value of an outcome, given the features?
New: What are the probabilities of an outcome, given the features?
13
20. Distributional Modelling
The ultimate goal of regression analysis is to obtain information about the [entire]
conditional distribution of a response given a set of explanatory variables.
• Focus of Machine-Learning is mainly on modelling E(Y |X = x) = f (x), e.g., splitting
procedures in CART favour the detection of changes in the mean.
• In general, they have very low power for detecting other patterns (e.g., changes in
variance) even if these can be related to covariates.
• Information about the entire conditional distribution P(Y y|X = x) = FY (y|x) is not
available.
⇤Hothorn, T. et al. (2014). Conditional transformation models. JRSS: Series B (Statistical Methodology) 76(1), 3–27.
15
22. Distributional Modelling
Relate all distributional parameters to explanatory variables
yi
ind
⇠ D
⇣
h1(#i1) = ⌘i1, h2(#i2) = ⌘i2, . . . , hK (#iK ) = ⌘iK
⌘
, i = 1, . . . , n
with a flexible predictor
⌘k = fk ( ), k = 1, . . . , K
where fk (·) can take on several forms
• fk ( ) = Xk k +
Ppk
j=1 fk,j ( j )
• fk ( ) = Random Forest, Gradient Boosting Trees, Neural Network, . . . 17
23. Distributional Modelling
• Restrictive assumption of strong stationarity in (time series) modelling
y
iid
⇠ D µ( ), #
FY (yt1 , . . . , ytn ) = FY (yt1+⌧ , . . . , ytn+⌧ ), 8 n, t1, . . . , tn, ⌧
• As all distributional parameters are functions of covariates, we shift the entire
distribution forward in time
y
ind
⇠ D #( )
18
31. Gradient Descent Boosting
Sketch of Algorithm
1. Set ˆf (x) = 0 and ri = yi for all i = 1, . . . , N in the training set
2. For t = 1, 2, . . . , T, repeat:
(a) Fit a tree ˆf t
to the training data (X, r)
(b) Update ˆf by adding a shrunken version of the new estimate
ˆf t
(x) = ˆf t 1
(x) + ⌘ˆf t
(x)
(c) Update the residuals
ri = ri ⌘ˆf t
(xi )
3. Output the boosted model
ˆf (x) =
TX
t=1
⌘ˆf t
(x)
25
32. Gradient Descent Boosting
• Boosting iteratively fits a tree to the residuals from the previous model.
• To see the connection between residuals and the Gradient, we need to go back to
Maximum Likelihood.
• In the following, we assume the data being generated as
y = f (x) + ✏
• Since most Machine Learning models focus on the conditional mean only, we have
E(y|x) = f (x)
26
33. Gradient Descent Boosting
• Most boosting models implicitly assume a Normal distribution as a default loss function,
with all higher moments being constant.
• For a Gaussian N f (x), 2
with density
f
⇣
y | f (x), 2
⌘
=
nY
i=1
1
p
2⇡ 2
exp
y f (x)
2
2 2
!
the corresponding log-likelihood is
log
⇣
L(f (x), )
⌘
=
n
2
ln(2⇡ 2
)
1
2 2
nX
i=1
yi f (xi )
2
27
34. Gradient Descent Boosting
• We focus on the Kernel of the Normal, where 2
is treated as a nuisance parameter
K =
1
2
y f (x)
2
• Since we want to minimize the error, we derive the negative Gradient
K
f (x)
=
f (x)
1
2
y f (x)
2
= y f (x) = r
• Assuming a Normal distribution with L2-loss, the negative Gradient are just the residuals.
Maximizing negative log-likelihood = Minimizing empirical risk function
28
48. XGBoost and Prediction Intervals
• XGBoost is based on Newton-Boosting (also called second-order Gradient-Boosting or
Hessian-Boosting).
• Problem: Gradient of Quantile-Loss is a step function while the Hessian is zero everywhere
and infinite at the origin.
−1.00
−0.75
−0.50
−0.25
0.00
−2 −1 0 1 2
Quantile Loss − Gradient
0.0e+00
2.5e+07
5.0e+07
7.5e+07
1.0e+08
−2 −1 0 1 2
Quantile Loss − Hessian
41
49. Connection between Distributional Modelling and XGBoost
• Newton boosting can be understood as an iterative empirical risk minimization procedure
in function space.
• Empirical risk minimization and Maximum-Likelihood estimation are closely related:
• GAMLSS⇤
are estimated using the 1st
and 2nd
order partial derivatives of the log-likelihood
w.r.t. to the distributional parameter #k of interest.
• By selecting an appropriate loss, or equivalently, a log-likelihood function,
Maximum-Likelihood can be formulated as empirical risk minimization.
• XGBoost can be interpreted as a statistical model.
⇤Generalized Additive Models for Location Scale and Shape
42
50. XGBoost - Estimation
XGBoost minimizes a regularized objective function
L(t)
=
nX
i=1
`[yi , ˆy
(t)
i ] + ⌦(ft)
=
nX
i=1
`[yi , ˆy
(t 1)
i + ft(xi )] + ⌦(ft)
Second order approximation of `[·] yields
˜L(t)
=
nX
i=1
[gi ft(xi ) +
1
2
hi f 2
t (xi )] + ⌦(ft)
with gi = @ˆy(t 1) `[yi , ˆy
(t 1)
i ] and hi = @2
ˆy(t 1) `[yi , ˆy
(t 1)
i ]
43
53. Simulation Example
0
10
20
0.00 0.25 0.50 0.75 1.00
x
y
XGBoost Regression − Simulated Data
y ⇠ N
⇣
µ = 10, = (1 + 4(0.3 < x < 0.5) + 2(x > 0.7)
⌘
47
54. Simulation Example
0
10
20
0.00 0.25 0.50 0.75 1.00
x
y
Random Forest Regression − Simulated Data
y ⇠ N
⇣
µ = 10, = (1 + 4(0.3 < x < 0.5) + 2(x > 0.7)
⌘
48
55. Simulation Example
Total Coverage: 89.2
Upper bound: 94.7
Lower bound: 5.5
0
10
20
0.00 0.25 0.50 0.75 1.00
x
y
XGBoostLSS Regression − Simulated Data
y ⇠ N
⇣
µ = 10, = (1 + 4(0.3 < x < 0.5) + 2(x > 0.7)
⌘
49
68. XGBoostLSS Benefits
• Extends XGBoost to probabilistic forecasting.
• Valid uncertainty quantification of forecasts.
• More than 80 available distributions (continuous, discrete and mixed discrete-continuous).
• High interpretability of results.
• Compatible with all XGBoost implementations, i.e., R, Julia, Python, . . .
• Parallelized model training (currently only CPU).
• Compatibility with large datasets (> 1 Mio. rows).
62
69. Thanks for your attention!
https://github.com/StatMixedML/XGBoostLSS
M¨arz, Alexander (2019). XGBoostLSS - An extension of XGBoost to probabilistic forecasting.
https://arxiv.org/abs/1907.03178v4
63
70. To learn more about the meetup, click the Link
https://www.meetup.com/Erlangen-Artificial-Intelligence-Machine-Learning-Meetup
Erlangen
Artificial Intelligence
Machine Learning Meetup
presents