2. How we will spend the next 60 minutes?
In thinking about the following topics:
3. In thinking about the following topics:
1. What does “probabilistic modeling” means?
How we will spend the next 60 minutes?
4. In thinking about the following topics:
1. What does “probabilistic modeling” means?
2. Why it is cool (sometimes)?
How we will spend the next 60 minutes?
5. In thinking about the following topics:
1. What does “probabilistic modeling” means?
2. Why it is cool (sometimes)?
3. How we can use it to build:
How we will spend the next 60 minutes?
6. In thinking about the following topics:
1. What does “probabilistic modeling” means?
2. Why it is cool (sometimes)?
3. How we can use it to build:
a. More robust and powerful models
How we will spend the next 60 minutes?
7. In thinking about the following topics:
1. What does “probabilistic modeling” means?
2. Why it is cool (sometimes)?
3. How we can use it to build:
a. More robust and powerful models
b. Models with predefined properties
How we will spend the next 60 minutes?
8. In thinking about the following topics:
1. What does “probabilistic modeling” means?
2. Why it is cool (sometimes)?
3. How we can use it to build:
a. More robust and powerful models
b. Models with predefined properties
c. Models without overfitting (o_O)
How we will spend the next 60 minutes?
9. In thinking about the following topics:
1. What does “probabilistic modeling” means?
2. Why it is cool (sometimes)?
3. How we can use it to build:
a. More robust and powerful models
b. Models with predefined properties
c. Models without overfitting (o_O)
d. Infinite ensembles of models (o_O)
How we will spend the next 60 minutes?
10. In thinking about the following topics:
1. What does “probabilistic modeling” means?
2. Why it is cool (sometimes)?
3. How we can use it to build:
a. More robust and powerful models
b. Models with predefined properties
c. Models without overfitting (o_O)
d. Infinite ensembles of models (o_O)
4. Deep Learning
How we will spend the next 60 minutes?
13. Problem statement: Empirical way
Suppose that we want to solve classical regression problem:
Typical approach:
1. Choose functional family for F(...)
2. Choose appropriate loss function
3. Choose optimization algorithm
4. Minimize loss on (X, Y)
5. ...
14. Problem statement: Empirical way
Suppose that we want to solve classical regression problem:
Typical approach:
1. Choose functional family for F(...)
2. Choose appropriate loss function
3. Choose optimization algorithm
4. Minimize loss on (X, Y)
5. ...
16. Problem statement: Probabilistic way
Define “probability model” (describes how your data was generated):
Having model you can calculate “likelihood” of your data:
17. Problem statement: Probabilistic way
Define “probability model” (describes how your data was generated):
Having model you can calculate “likelihood” of your data:
We are working with i.i.d. data
18. Problem statement: Probabilistic way
Define “probability model” (describes how your data was generated):
Having model you can calculate “likelihood” of your data:
Sharing the same variance
21. Problem statement: Probabilistic way
Data log-likelihood:
Maximum likelihood estimation:
MSE Loss minimization
For i.i.d. data sharing the same variance!
24. Problem statement: Probabilistic way
1. MAE minimization = likelihood maximization of i.i.d. Laplace-distributed variables
Empirical loss minimizationLog-Likelihood maximization =
25. Problem statement: Probabilistic way
1. MAE minimization = likelihood maximization of i.i.d. Laplace-distributed variables
2. For each empirically stated problem exists appropriate probability model
Empirical loss minimizationLog-Likelihood maximization =
26. Problem statement: Probabilistic way
1. MAE minimization = likelihood maximization of i.i.d. Laplace-distributed variables
2. For each empirically stated problem exists appropriate probability model
3. Empirical loss is often just a particular case of wider probability model
Empirical loss minimizationLog-Likelihood maximization =
27. Problem statement: Probabilistic way
1. MAE minimization = likelihood maximization of i.i.d. Laplace-distributed variables
2. For each empirically stated problem exists appropriate probability model
3. Empirical loss is often just a particular case of wider probability model
4. Wider model = wider opportunities!
Empirical loss minimizationLog-Likelihood maximization =
28. Probabilistic modeling: Wider opportunities for Flo
Suppose that we have:
1. N unique users in the training set
2. For each user we’ve collected time series of user states (on daily basis):
3. For each user we’ve collected time series of cycles lengths:
4. We predict time series of lengths Y based on time series of states X
30. Probabilistic modeling: Wider opportunities for Flo
We want to maximize data likelihood:
Probability that user i will have
cycle with length y at day j
31. Probabilistic modeling: Wider opportunities for Flo
We want to maximize data likelihood:
Just another notationProbability that user i will have
cycle with length y at day j
32. Probabilistic modeling: Wider opportunities for Flo
We want to maximize data likelihood:
Cycle length of user i at day j has
Gaussian distribution
33. Probabilistic modeling: Wider opportunities for Flo
We want to maximize data likelihood:
Parameters of distribution at day j
depends on model parameters
and all features up to day j
34. Probabilistic modeling: Wider opportunities for Flo
We want to maximize data likelihood:
Can be easily modeled with deep RNN!
35. Probabilistic modeling: Wider opportunities for Flo
We want to maximize data likelihood:
Can be easily modeled with deep RNN!
Note that:
36. Probabilistic modeling: Wider opportunities for Flo
We want to maximize data likelihood:
Can be easily modeled with deep RNN!
Note that:
We don’t need any labels to predict variance!
42. Maximum a posteriori estimator
Until now, we’ve been talking about Maximum Likelihood Estimator:
43. Maximum a posteriori estimator
Until now, we’ve been talking about Maximum Likelihood Estimator:
Now assume that prior distribution over parameters exists:
44. Maximum a posteriori estimator
Until now, we’ve been talking about Maximum Likelihood Estimator:
Now assume that prior distribution over parameters exists:
Then we can apply Bayes Rule:
45. Maximum a posteriori estimator
Until now, we’ve been talking about Maximum Likelihood Estimator:
Now assume that prior distribution over parameters exists:
Then we can apply Bayes Rule:
Posterior distribution
over model parameters
46. Maximum a posteriori estimator
Until now, we’ve been talking about Maximum Likelihood Estimator:
Now assume that prior distribution over parameters exists:
Then we can apply Bayes Rule:
Data likelihood for specific parameters
(could be modeled with Deep Network!)
47. Maximum a posteriori estimator
Until now, we’ve been talking about Maximum Likelihood Estimator:
Now assume that prior distribution over parameters exists:
Then we can apply Bayes Rule:
Prior distribution over parameters
(describes our prior knowledge or / and
our desires for the model)
48. Maximum a posteriori estimator
Until now, we’ve been talking about Maximum Likelihood Estimator:
Now assume that prior distribution over parameters exists:
Then we can apply Bayes Rule:
Bayesian evidence
49. Maximum a posteriori estimator
Until now, we’ve been talking about Maximum Likelihood Estimator:
Now assume that prior distribution over parameters exists:
Then we can apply Bayes Rule:
Bayesian evidence
A powerful method for model selection!
50. Maximum a posteriori estimator
Until now, we’ve been talking about Maximum Likelihood Estimator:
Now assume that prior distribution over parameters exists:
Then we can apply Bayes Rule:
As a rule this integral is intractable :(
(You can never integrate this)
54. Maximum a posteriori estimator
The core idea of Maximum a Posteriori Estimator:
The only (but powerful!)
difference from MLE
55. Maximum a posteriori estimator
The core idea of Maximum a Posteriori Estimator:
1. MAP estimates model parameters as mode of posterior distribution
56. Maximum a posteriori estimator
The core idea of Maximum a Posteriori Estimator:
1. MAP estimates model parameters as mode of posterior distribution
2. MAP estimation with non-informative prior = MLE
57. Maximum a posteriori estimator
The core idea of Maximum a Posteriori Estimator:
1. MAP estimates model parameters as mode of posterior distribution
2. MAP estimation with non-informative prior = MLE
3. MAP restricts the search space of possible models
58. Maximum a posteriori estimator
The core idea of Maximum a Posteriori Estimator:
1. MAP estimates model parameters as mode of posterior distribution
2. MAP estimation with non-informative prior = MLE
3. MAP restricts the search space of possible models
4. With MAP you can put restrictions not only on model weights but also on many
interactions inside the network
73. Probabilistic modeling: Regularization
1. Laplace distribution as a prior = L1 regularization
2. It can be shown that Dropout is also a form of particular probability model …
74. Probabilistic modeling: Regularization
1. Laplace distribution as a prior = L1 regularization
2. It can be shown that Dropout is also a form of particular probability model …
3. … a Bayesian one :) …
75. Probabilistic modeling: Regularization
1. Laplace distribution as a prior = L1 regularization
2. It can be shown that Dropout is also a form of particular probability model …
3. … a Bayesian one :) …
4. … and therefore can be used not only as a regularization technique!
76. Probabilistic modeling: Regularization
1. Laplace distribution as a prior = L1 regularization
2. It can be shown that Dropout is also a form of particular probability model …
3. … a Bayesian one :) …
4. … and therefore can be used not only as a regularization technique!
5. Do you want to pack your network weights into few kilobytes?
77. Probabilistic modeling: Regularization
1. Laplace distribution as a prior = L1 regularization
2. It can be shown that Dropout is also a form of particular probability model …
3. … a Bayesian one :) …
4. … and therefore can be used not only as a regularization technique!
5. Do you want to pack your network weights into few kilobytes?
6. Ok, all you need - is MAP!
78. Probabilistic modeling: Regularization
1. Laplace distribution as a prior = L1 regularization
2. It can be shown that Dropout is also a form of particular probability model …
3. … a Bayesian one :) …
4. … and therefore can be used not only as a regularization technique!
5. Do you want to pack your network weights into few kilobytes?
6. Ok, all you need - is MAP!
MAP - is all you need!
79. Weights packing: Empirical way
Song Han and others - Deep Compression: Compressing Deep Neural Networks with Pruning,
Trained Quantization and Huffman Coding (2015)
Modern neural networks could be dramatically compressed:
81. 1. Define prior distribution of weights as Gaussian Mixture Model
Mixture of Gaussians =
Weights packing: Soft-Weight Sharing
82. 1. Define prior distribution of weights as Gaussian Mixture Model
2. For one of the Gaussian components force:
Weights packing: Soft-Weight Sharing
83. 1. Define prior distribution of weights as Gaussian Mixture Model
2. For one of the Gaussian components force:
3. Maybe define Gamma prior for variances (for numerical stability)
Weights packing: Soft-Weight Sharing
84. 1. Define prior distribution of weights as Gaussian Mixture Model
2. For one of the Gaussian components force:
3. Maybe define Gamma prior for variances (for numerical stability)
4. Just find MAP estimation for both model parameters and free mixture parameters!
Weights packing: Soft-Weight Sharing
87. Maximum a posteriori estimation
1. Pretty cool and powerful technique
2. You can build hierarchical models (put priors on priors of priors of…)
3. You can put priors on activations of layers (sparse autoencoders)
4. Leads to “Empirical Bayes”
5. Thinking how to restrict your model? Try to find appropriate prior!
89. True Bayesian Modeling: Recap
1. Posterior could be easily found in case of conjugate distributions
90. True Bayesian Modeling: Recap
1. Posterior could be easily found in case of conjugate distributions
2. But for most real life models denominator is intractable
91. True Bayesian Modeling: Recap
1. Posterior could be easily found in case of conjugate distributions
2. But for most real life models denominator is intractable
3. In MAP denominator is totally ignored
92. True Bayesian Modeling: Recap
1. Posterior could be easily found in case of conjugate distributions
2. But for most real life models denominator is intractable
3. In MAP denominator is totally ignored
4. Can we find a good approximation of the posterior?
95. True Bayesian Modeling: Approximation
Two main ideas:
1. MCMC (Monte Carlo Markov Chain) - a tricky one
96. True Bayesian Modeling: Approximation
Two main ideas:
1. MCMC (Monte Carlo Markov Chain) - a tricky one
2. Variational Inference
97. True Bayesian Modeling: Approximation
Two main ideas:
1. MCMC (Monte Carlo Markov Chain) - a tricky one
2. Variational Inference - a “Black Magic” one
98. True Bayesian Modeling: Approximation
Two main ideas:
1. MCMC (Monte Carlo Markov Chain) - a tricky one
2. Variational Inference - a “Black Magic” one
Another ideas exists:
1. Monte Carlo Dropout
2. Stochastic gradient langevin dynamics
3. ...
99. True Bayesian Modeling: MCMC
1. Key idea is to construct Markov Chain which has posterior distribution as
its equilibrium distribution
100. True Bayesian Modeling: MCMC
1. Key idea is to construct Markov Chain which has posterior distribution as
its equilibrium distribution
2. Then you can burn-in Markov Chain (convergence to equilibrium) and then
sample from the posterior distribution
101. True Bayesian Modeling: MCMC
1. Key idea is to construct Markov Chain which has posterior distribution as
its equilibrium distribution
2. Then you can burn-in Markov Chain (convergence to equilibrium) and then
sample from the posterior distribution
3. Sounds tricky, but it is well-defined procedure
102. True Bayesian Modeling: MCMC
1. Key idea is to construct Markov Chain which has posterior distribution as
its equilibrium distribution
2. Then you can burn-in Markov Chain (convergence to equilibrium) and then
sample from the posterior distribution
3. Sounds tricky, but it is well-defined procedure
4. PyMC3 = Bayesian Modeling and Probabilistic Machine Learning in Python
103. True Bayesian Modeling: MCMC
1. Key idea is to construct Markov Chain which has posterior distribution as
its equilibrium distribution
2. Then you can burn-in Markov Chain (convergence to equilibrium) and then
sample from the posterior distribution
3. Sounds tricky, but it is well-defined procedure
4. PyMC3 = Bayesian Modeling and Probabilistic Machine Learning in Python
5. Unfortunately, it is not scalable
104. True Bayesian Modeling: MCMC
1. Key idea is to construct Markov Chain which has posterior distribution as
its equilibrium distribution
2. Then you can burn-in Markov Chain (convergence to equilibrium) and then
sample from the posterior distribution
3. Sounds tricky, but it is well-defined procedure
4. PyMC3 = Bayesian Modeling and Probabilistic Machine Learning in Python
5. Unfortunately, it is not scalable
6. So, you can’t explicitly apply it to complex models (like Neural Networks)
105. True Bayesian Modeling: MCMC
1. Key idea is to construct Markov Chain which has posterior distribution as
its equilibrium distribution
2. Then you can burn-in Markov Chain (convergence to equilibrium) and then
sample from the posterior distribution
3. Sounds tricky, but it is well-defined procedure
4. PyMC3 = Bayesian Modeling and Probabilistic Machine Learning in Python
5. Unfortunately, it is not scalable
6. So, you can’t explicitly apply it to complex models (like Neural Networks)
7. But implicit scaling is possible: Bayesian learning via stochastic gradient
langevin dynamics (2011)
159. Bayesian Networks: Step by step
Define functional family for approximate posterior (e.g. Gaussian):
160. Bayesian Networks: Step by step
Define functional family for approximate posterior (e.g. Gaussian):
Solve optimization problem (with doubly stochastic gradient ascend):
161. Bayesian Networks: Step by step
Define functional family for approximate posterior (e.g. Gaussian):
Solve optimization problem (with doubly stochastic gradient ascend):
Having approximate posterior
you can sample network weights (as much as you want)!
162. Bayesian Networks: Pros and Cons
As a result you have:
1. Infinite ensemble of Neural Networks!
2. No overfit problem (in classical sense)!
3. No adversarial examples problem!
4. Measure of prediction confidence!
5. ...
163. Bayesian Networks: Pros and Cons
As a result you have:
1. Infinite ensemble of Neural Networks!
2. No overfit problem (in classical sense)!
3. No adversarial examples problem!
4. Measure of prediction confidence!
5. ...
No free hunch:
1. A lot of work is still hidden in “scalability” and “convergence”!
2. Very (very!) expensive predictions!
165. Bayesian Networks Examples: SegNet
Alex Kendall and others - Bayesian SegNet: Model Uncertainty in
Deep Convolutional Encoder-Decoder Architectures for Scene Understanding (2016)
166. Bayesian Networks Examples: SegNet
Alex Kendall and others - Bayesian SegNet: Model Uncertainty in
Deep Convolutional Encoder-Decoder Architectures for Scene Understanding (2016)
167. Bayesian Networks Examples: SegNet
Alex Kendall and others - Bayesian SegNet: Model Uncertainty in
Deep Convolutional Encoder-Decoder Architectures for Scene Understanding (2016)
168. Bayesian Networks Examples: SegNet
Alex Kendall and others - Bayesian SegNet: Model Uncertainty in
Deep Convolutional Encoder-Decoder Architectures for Scene Understanding (2016)
169. Bayesian Networks Examples: SegNet
Alex Kendall and others - Bayesian SegNet: Model Uncertainty in
Deep Convolutional Encoder-Decoder Architectures for Scene Understanding (2016)
170. Bayesian Networks in (near) Production: UBER
Lingxue Zhu - Deep and Confident Prediction for Time Series at Uber (2017)
How it works:
1. LSTM network
2. Monte Carlo Dropout
3. Daily complete trips
prediction
4. Anomaly detection for
various metrics
171. Bayesian Networks in (near) Production: UBER
Lingxue Zhu - Deep and Confident Prediction for Time Series at Uber (2017)
How it works:
1. LSTM network
2. Monte Carlo Dropout
3. Daily complete trips
prediction
4. Anomaly detection for
various metrics
172. Bayesian Networks in (near) Production: UBER
Lingxue Zhu - Deep and Confident Prediction for Time Series at Uber (2017)
How it works:
1. LSTM network
2. Monte Carlo Dropout
3. Daily complete trips
prediction
4. Anomaly detection for
various metrics
173. Bayesian Networks in (near) Production: Flo
Predicted distributions of cycle length for 40 independent users:
Switched to Empirical Bayes for now.
175. Speech Summary
1. Probabilistic modeling is a powerful tool with strong math background
2. Many techniques are currently not widely used in Deep Learning
176. Speech Summary
1. Probabilistic modeling is a powerful tool with strong math background
2. Many techniques are currently not widely used in Deep Learning
3. You can improve many aspects of your model using the same framework
177. Speech Summary
1. Probabilistic modeling is a powerful tool with strong math background
2. Many techniques are currently not widely used in Deep Learning
3. You can improve many aspects of your model using the same framework
4. Scalability, stability of convergence and inference cost are main constraints
178. Speech Summary
1. Probabilistic modeling is a powerful tool with strong math background
2. Many techniques are currently not widely used in Deep Learning
3. You can improve many aspects of your model using the same framework
4. Scalability, stability of convergence and inference cost are main constraints
5. The future of Deep Learning looks Bayesian...
179. Speech Summary
1. Probabilistic modeling is a powerful tool with strong math background
2. Many techniques are currently not widely used in Deep Learning
3. You can improve many aspects of your model using the same framework
4. Scalability, stability of convergence and inference cost are main constraints
5. The future of Deep Learning looks Bayesian...
… (for the moment, for me)
180. Thank you for your !
I hope, you have a lot of questions :)
(attention)
Dzianis Dus
Lead Data Scientist at InData Labs