Even in the era of Big Data there are many real-world problems where the number of input features has about the some order of magnitude than the number of samples. Often many of those input features are irrelevant and thus inferring the relevant ones is an important problem in order to prevent over-fitting. Automatic Relevance Determination solves this problem by applying Bayesian techniques.
9. 9
Prefer the model with high evidence for a given dataset
Source: D. J. C. MacKay. Bayesian Interpolation. 1992
10. 1. Model fitting: Assume โ๐ is the right model and fit its parameters ๐ with Bayes:
๐ ๐ ๐ท, โ๐ =
๐ ๐ท ๐, โ๐ ๐(๐|โ๐)
๐(๐ท|โ๐)
โBusiness as usualโ
2. Model comparison: Compare different models with the help of their evidence
๐ ๐ท โ๐ and model prior ๐ โ๐ :
๐ โ๐ ๐ท โ ๐ ๐ท โ๐ ๐ โ๐
โOccamโs razor at workโ
10
13. Given:
๏ง Dataset ๐ท = ๐ฅ ๐, ๐ก ๐ with ๐ = 1 โฆ ๐
๏ง Set of (non-linear) functions ฮฆ = {๐โ: ๐ฅ โผ ๐(๐ฅ)} with โ = 1 โฆ ๐
Assumption:
๐ฆ ๐; ๐ =
โ=1
๐
๐คโ ๐โ(๐) ,
๐ก ๐ = ๐ฆ ๐; ๐ + ๐ ๐,
where ๐ ๐ is an additive noise with ๐ฉ 0, ๐ผโ1
Task: Find min
๐
โฮฆ๐ โ ๐โ2
(Ordinary Least Squares)
13
14. 14
Problem:
Having too many features leads to overfitting!
Regularization
Assumption: โWeights are smallโ
๐ ๐; ๐ ~๐ฉ(0, ๐โ1 ๐)
Task: Given ๐ผ, ๐ find
min
๐
๐ผ ฮฆ๐ โ ๐ 2 + ๐ ๐ 2
15. 15
Consider each ๐ผ๐, ๐๐ defining a model โ๐ ๐ผ, ๐ .
Yes! That means we can use
our Bayesian Interpolation to
find ๐, ๐ถ, ๐ with the highest
evidence!
This is the idea behind BayesianRidge as found in sklearn.linear_model
16. Consider that each weight has an individual variance, so that
๐ ๐ ๐ ~๐ฉ 0, ฮโ1 ,
where ฮ = diag(๐1, โฆ , ๐ ๐ป), ๐โ โ โ+.
Now, our minimization problem is:
min
๐
๐ผ ฮฆ๐ โ ๐ 2 + ๐ ๐กฮ๐
16
Pruning: If precision ๐โ of feature โ is high, its weight ๐คโ is very likely to
be close to zero and is therefore pruned.
This is called Sparse Bayesian Learning or Automatic Relevance
Determination. Found as ARDRegression under sklearn.linear_model.
17. Crossvalidation can be used for the estimation of hyperparmeters but suffers from
the curse of dimensionality (inappropriate for low-statistics).
17
Source: Peter Ellerton, http://pactiss.org/2011/11/02/bayesian-inference-homo-bayesianis/
18. โข Random 100 ร 100 design matrix ฮฆ with 100 samples and 100
features
โข Weights ๐ค๐, ๐ โ ๐ผ = 1, โฆ , 100 , random subset J โ ๐ผ with ๐ฝ = 10, and
๐ค๐ =
0, ๐ โ ๐ผJ
๐ฉ(๐ค๐; 0, 1
4), ๐ โ ๐ฝ
โข Target ๐ = ฮฆ๐ + ๐ with random noise ๐๐ โผ ๐ฉ(0, 1
50)
Task: Reconstruct the weights, especially the 10 non-zero weights!
Source: http://scikit-learn.org/stable/auto_examples/linear_model/plot_ard.html#example-linear-model-plot-ard-py
18
23. We have to determine the parameters ๐ค, ๐, ๐ผ for
๐ ๐, ๐, ๐ผ ๐ = ๐ ๐ ๐, ๐, ๐ผ ๐ ๐, ๐ผ ๐
1) Model fitting:
For the first factor, we have ๐ ๐ ๐, ๐, ๐ผ ~๐ฉ(๐, ฮฃ) with
ฮฃ = ฮ + ๐ผฮฆ ๐
ฮฆ โ1
,
๐ = ๐ผฮฃฮฆT ๐ญ.
23
24. 2) Model comparison:
For the second factor, we have
๐ ๐, ๐ผ ๐ โ ๐ ๐ ๐, ๐ผ ๐ ๐ ๐ ๐ผ ,
where ๐ ๐ and ๐(๐ผ) are hyperpriors which we assume uniform.
Using marginalization, we have
๐ ๐ ๐, ๐ผ = ๐ ๐ ๐, ๐ผ ๐ ๐ ๐ ๐๐,
i.e. marginal likelihood or the โevidence for the hyperparameterโ.
24
25. Differentiation of the log marginal likelihood with respect to ๐๐ and ๐ผ as
well as setting these to zero, we get
๐๐ =
๐พ๐
๐๐
2 ,
๐ผ =
๐ โ ๐ ๐พ๐
๐ โ ฮฆ๐ 2
,
with ๐พ๐ = 1 โ ๐๐ฮฃ๐๐.
These formulae are used to find the maximum points ๐ ๐๐ and ๐ผ ๐๐.
25
26. 1. Starting values ๐ผ = ๐โ2(๐), ๐ = ๐
2. Calculate ฮฃ = ฮ + ๐ผฮฆ ๐ฮฆ โ1 and ๐ = ๐ = ๐ผฮฃฮฆT ๐ญ
3. Update ๐๐ =
๐พ ๐
๐ ๐
2 and ๐ผ =
๐โ ๐ ๐พ ๐
๐โฮฆ๐ 2 where ๐พ๐ = 1 โ ๐๐ฮฃ๐๐
4. Prune ๐๐ and ๐๐ if ๐๐ > ๐ ๐กโ๐๐๐ โ๐๐๐
5. If not converged go to 2.
Sklearn implementation:
The parameters ๐ผ1, ๐ผ2 as well as ๐1, ๐2 are the hyperprior parameters
for ๐ผ and ๐ with
๐ ๐ผ โผ ฮ ๐ผ1, ๐ผ2
โ1
, ๐ ๐๐ โผ ฮ ๐1, ๐2
โ1
.
๐ธ ฮ ๐ผ, ๐ฝ =
๐ผ
๐ฝ
and ๐ ฮ ๐ผ, ๐ฝ =
๐ผ
๐ฝ2.
26
27. Given a some new data ๐ฅโ, a prediction for ๐กโ is made by
๐ ๐กโ ๐, ๐ ๐๐, ๐ผ ๐๐ = ๐ ๐กโ ๐, ๐ผ ๐๐ ๐ ๐ ๐, ๐ ๐๐, ๐ผ ๐๐ ๐๐
= ๐ฉ ๐ ๐ ๐ ๐ฅโ , ๐ผ ๐๐
โ1
+ ๐ ๐ฅโ
๐กฮฃ๐ ๐ฅโ .
This is a good approximation of the predictive distribution
๐ ๐กโ ๐ = ๐ ๐กโ ๐, ๐, ๐ผ ๐ ๐, ๐, ๐ผ ๐ ๐๐ ๐๐ ๐ฮฑ .
27
28. 1. D. J. C. MacKay. Bayesian Interpolation. 1992
(โฆ to understand the overall idea)
2. M. E.Tipping. Sparse Bayesian learning and the RelevanceVector
Machine. June, 2001
(โฆ to understand the ARD algorithm)
3. T. Fletcher. RelevanceVector Machines Explained. October, 2010
(โฆ to understand the ARD algorithm in detail)
4. D.Wipf. A NewView of Automatic Relevance Determination. 2008
(โฆ not as good as the ones above)
Graphs from slides 7 and 9 were taken from [1] and the awesome
tutorials of Scikit-Learn were consulted many times.
28