Diese Präsentation wurde erfolgreich gemeldet.
Wir verwenden Ihre LinkedIn Profilangaben und Informationen zu Ihren Aktivitäten, um Anzeigen zu personalisieren und Ihnen relevantere Inhalte anzuzeigen. Sie können Ihre Anzeigeneinstellungen jederzeit ändern.
Nächste SlideShare
Wird geladen in …5
×

# Machine Learning basics

Knowledge Sharing Presentation about the basics of Machine Learning.

• Full Name
Comment goes here.

Are you sure you want to Yes No
Your message goes here
• Als Erste(r) kommentieren

• Gehören Sie zu den Ersten, denen das gefällt!

### Machine Learning basics

1. 1. Machine Learning basics [reused from LDL for KSS] David Samu 14 Feb 2020 Knowledge Sharing Session
2. 2. Caveat These are slides reused from Learning Deep Learning series New presentation on Deep Learning for Land Cover Segmentation is on the way :)
3. 3. Table of Contents 1. Intro: ML Basic concepts 2. ML from a Statistical Learning perspective 3. ML from a Probabilistic perspective 4. ML algorithms in practice 5. Outlook: Challenges for classical ML algorithms
4. 4. Intro: Machine Learning basic concepts
5. 5. What is Machine Learning? ● A machine learning algorithm is an algorithm that is able to learn from data. ● Machine learning is essentially a form of applied statistics with increased emphasis on the use of computers to statistically estimate complicated functions and a decreased emphasis on proving conﬁdence intervals around these functions. ● Deep learning is a speciﬁc kind of machine learning.
6. 6. A deluge of options… How to select the (a) right one? Scikit-Learn classifier comparison (source: scikit-learn.org)
7. 7. Anatomy of a Machine Learning algorithm ● T: task ● P: performance measure ● E: experience “A computer program is said to learn from experience E with respect to some task T and performance measure P, if its performance at task T, as measured by P, improves with experience E.” Typical main ML/DL components in practice: ● Dataset ● Model ● Objective function ● Optimization algorithm It’s useful to think of these as independent components of an ML system!
8. 8. Common ML methods we are already using ● Classification ○ which of k categories the input belongs to? ○ e.g. tree / no tree, which of n species? ● Regression ○ predict numerical value given some input ○ e.g. age / height / density of tree? ● Structured output ○ predict multiple interrelated variables ○ e.g. pixel-wise segmentation into object categories ● Anomaly detection ○ flag / predict unusual or atypical events ○ pipeline leak, natural disasters ● Denoising ○ predict the clean from corrupted data ○ SAR denoising Tang et al, 2018
9. 9. Other ML tasks we could apply in the future ● Transcription, translation ● Synthesis and sampling ● Imputing missing values ● Density estimation ● ... For what tasks can we use these techniques in the future? E.g. density estimation (GANs) for missing / corrupted value imputation: fill in cloud covered areas using other parts of the image, timestamps, sensors, ...
10. 10. Experience (aka the Dataset) ● Supervised learning ○ Regression, classification, … ● Unsupervised learning ○ Learn interesting properties, e.g. clustering ○ Learn entire generative prob. distr. ● Reinforcement learning ○ Models interacting with the environment ● (Energy-based, Adversarial, …) Although these terms are useful in practice, they not formally defined, and not always distinct! Many times they can be interchanged and combined: e.g. density estimation to support classification. Karczewski et al 2014
11. 11. Example: solving the linear regression problem ● Is this an iterative optimization process? ● Why didn’t we need to use gradient descent or some other method? ● Why would this not work for deep networks?
12. 12. Some thoughts before diving in to ML concepts ● Central challenge of statistical learning, ML, DL: tame complexity and uncertainty ● Many tools and options, some guarantees, lots of uncertainty, no “best” answer ● Model design is a combination of science and art (+ engineering in practice) ● A spectrum where we aim towards the left side, but often find ourselves on the right: ○ deterministic ←→ probabilistic ○ formal ←→ heuristic ○ deductive ←→ inductive ○ analytic ←→ numeric Situation we want to avoid by understanding through theory
13. 13. Machine Learning from a statistical learning perspective
14. 14. Capacity, Overfitting, Underfitting Central challenge in ML: generalization to unseen input! Training error ←→ Test error Test set should be collected separately from training set! pdata: data-generating distribution Assumed to be i.i.d. for mathematical study Underfitting ← Capacity → Overfitting Model’s hypothesis space ~ Model capacity Ground truth: f(x) = sin(x), n=10 noisy data points, M: degree of polynomial fit Bishop: Pattern Recognition and Machine Learning, 2006
15. 15. Capacity, Overfitting, Underfitting (cont’d) Challenge: find model complexity that fits the task and the available data! The model’s representational capacity is usually constrained by insufficient / biased data and the optimization process → effective capacity Occam’s razor / principle of parsimony: from similarly performing models, choose the simpler one! No Free Lunch Theorem: no ML algorithm is universally any better than any other… Modeller has to find and design an appropriate model for the task & data at hand! (Cf. with figure on prev slide) Bishop: Pattern Recognition and Machine Learning, 2006
16. 16. Regularization “Preferences” for certain kind of solutions L2 regularization, a.k.a. weight decay Trade-off between fitting training data and small weights. In general, regularization aims to decrease the generalization error at the expense of training error. Central concern in ML, next to optimization (Chapters 7 and 8) (Cf. with figure on prev slide) Bishop: Pattern Recognition and Machine Learning, 2006 Bishop: Pattern Recognition and Machine Learning, 2006
17. 17. Hyperparameters, validation set Model parameters that are not directly optimized by the learning algorithm Difficult or not appropriate to optimize on the training set, e.g. model capacity or λ Validated against a validation set of examples, before using test set to measure generalization error. If used repeatedly, test sets can introduce bias into model selection! We need to update / extend / replace our test sets regularly! 1. Training set: find model weights w for fixed HP settings 2. Validation set: tune HPs 3. Test set: measure generalization error
18. 18. Bias and variance expected deviation from real value of data-generating parameter Unbiased estimator: bias = 0 ` variance of estimate (as a result of resampling of data), decreases with # of samples Both are errors of an estimator (i.e. bad), that we can trade-off to minimize MSE Let’s derive this! Bias-variance and underfitting-overfitting are (loosely) related concepts through model capacity. ● Too high capacity: may need to regularize. ● More data should(!) always help (consistency) ● E.g. mean as 1st sample: unbiased but inconsistent
19. 19. Taking a step back: What is the purpose of all these abstract concepts again? Keep in mind: We introduce these concepts to gain some insight on how well particular models would generalize for given datasets! E.g. data-generating distribution / process, “true” distribution / parameters, etc are all hypothetical concepts representing the “unknown” to help us estimate our expected generalization error for different models! And there are more abstract concepts to come! :0 “Data generating process” Observed data sampling What we want: find model that generalizes to unseen data → model and approximate “data- generating process”
20. 20. Machine Learning from a probabilistic perspective
21. 21. Maximum Likelihood Estimation (MLE) θML: the parameter value that maximizes the probability of the observed data Finally, now we’re getting VERY abstract! :-) Demystifying KL Divergence https://towardsdatascience.com/
22. 22. KL divergence, cross-entropy and their other friends Minimizing the KL divergence between the empirical distribution pdata and model distribution pmodel (for some set of model parameters θ) is equivalent to: ● maximizing the likelihood / minimizing the negative log likelihood (NLL) of pmodel ● maximizing the expectation of the observed data under pmodel ● minimizing the cross-entropy between pdata and pmodel ● maximizing the mutual information between pdata and pmodel Different perspectives of the same concept! Remember those intriguing log functions in the definition of entropy from Chapter 3? This is why they are important in ML (theory)! Cross-entropy (between pdata and pmodel): any loss consisting of a negative log-likelihood Most frequently used cost function in DL! ● Why does minimizing this maximizes the probability of the data under model? ● Why is it useful / why do we want to do that?
23. 23. Conditional Log-Likelihood So far we discussed the case of models that generate the entire observed dataset, which is useful for unsupervised learning. How about the supervised learning when we only want to learn the labels? Conditional LLH! If i.i.d. simplifies to: But now the formula became longer. Why is this a “simplification”? Minimizing MSE during curve fitting is equivalent to maximizing the log-likelihood of y given x under model w, assuming Gaussian distributed “noise”. Bishop: Pattern Recognition and Machine Learning, 2006
24. 24. “Entropy - shmentropy… Do I really need to know all this complicated business to train real ML models?” https://www.inference.vc/ In practice, not really. But it definitely helps to understand what others (and you) are doing, even in just practice ;-) Why is Maximum Likelihood estimation useful in ML? It has the property of consistency with the highest rate of convergence among all possible estimator (efficiency)! [Cramér-Rao lower bound]
25. 25. Frequentist vs Bayesian statistics Frequentist perspective (so far): ● θ is fixed but unknown ● θhat is probabilistic due to stochastic nature of sampling observed data Bayesian perspective: ● Observed data is fixed, not random ● Probability: uncertainty in our belief about the true value of θ: p(θ) ● Observations decrease prior uncertainty / increase knowledge via belief update
26. 26. Belief update in Bayesian statistics Belief update using Bayes rule: ● Why does this equation hold? ● Which term is the prior, evidence, posterior, normalization constant? ● How / why does this belief update process work? www.tu-chemnitz.de/
27. 27. ● Frequentist’s critique: Prior is a subjective human choice! ● The Bayesian’s rebuke: State your assumptions explicitly! Bayes learning Predicted distribution after observing m samples: In general, Bayesian estimation is more conservative than frequentist Max LH point estimate: ● It has lower variance / risk of overfitting, due to integration over ∀θ, ● but it is potentially biased, underfit model, e.g. if prior is not well chosen
28. 28. Max a posteriori estimate and Regularized Max LL MAP: a less precise, but tractable point estimate alternative to full posterior PD. LLH + bias Additive regularization terms in Max LL often correspond to priors in MAP. E.g. weight decay penalizes larger weights → biased toward smaller weights → a Bayesian prior centered around 0. High / low weight of regularization term ←→ narrow / broad shape of prior distribution https://towardsdatascience.com
29. 29. Recap: linear regression and logistic regression If we know → we solved the problem! In practice: limited data → need to generalize out of sample → we use: ● parametric family of distributions: ● MLE or MAP to find best parameter vector θ Eg linear regression: ● closed-form solution exist Logistic regression: ● no closed form solution, need to minimize negative LLH by e.g. gradient descent Wikipedia
30. 30. Machine learning algorithms (estimators) in practice
31. 31. Some supervised methods Task: associate input x with output y Example so far: linear regression
32. 32. Support Vector Machine Kernel trick: dot product between examples rewritten to non-linear kernel function Learning a linear model in a transformed space ● Space transformation is kept fixed ● Transformed linear model is easy to learn Drawback: comp. cost of learning does not scale well with training examples!
33. 33. k-nearest neighbour (KNN) Quintessential non-parametric, non- probabilistic “learning” method: ● take average / most frequent of k nearest neighbours Pros: High capacity, no training, typically good accuracy with large training data Cons: large “model” size (full dataset) and computational cost, provides zero “insight” to problem 1-NN classification map Wikipedia
34. 34. Basic algorithm: axis-aligned splits with constant outputs → non-parametric and non-probabilistic Many variants (making DTs more parametric) ● Regularized DTs (e.g. pruning) ● Ensemble methods ○ Boosted trees ○ Bagged trees / random forests Pros: simple DTs are easy to interpret (RFs not!) Cons: unstable (small change in input can cause large change in model / output) Decision trees
35. 35. Some unsupervised methods Task: “extract information” from a distribution such as: ● Density estimation ● Sampling ● Denoising ● Manifold learning ● (dimensionality reduction) ● Clustering Classic task: find “best representation” of data ● Simplify while retaining as much information as possible Non-linear dimensionality reduction Wikipedia
36. 36. “Best” in what sense? Three most common criteria for best representation: ● Low-dimensional representation ● Sparse representation ● Independent representation These properties are useful principles because they help us: ● manipulate data more effectively ● understand the data-generator process (ultimately Nature) We want to disentangle unknown factors of variation, remove redundancy Karczewski et al 2014
37. 37. Principal Component Analysis (PCA) PCA learns: 1. lower dimensional representations 2. with linearly independent dimensions while preserving as much information (variance) of the original data as possible. Disentangle factors of variation by finding a rotation that transforms the principal axes variation to new basis vectors. Wikipedia
38. 38. K-means clustering Divide training set into k different clusters of examples that are “close” to each other. Number of clusters? Distance metric? Extreme case of sparse representation (one-hot vector coding.), but loses advantage of distributed representations. Algorithm: iterative refinement of clusters 1. Assign class labels 2. Update centroids https://scikit-learn.org/
39. 39. https://scikit-learn.org/stable/tutorial/machine_learning_map/index.html No free lunch!
40. 40. Outlook: Challenges for classical ML algorithms
41. 41. Anatomy of a Machine Learning algorithm (revisited) Typical main ML/DL components in practice: ● Dataset ● Model ● Objective function ● Optimization algorithm It’s useful to think of these as independent components of an ML system!
42. 42. The Curse of Dimensionality Problem: As the number of dimensions of the data increases, the number of conﬁgurations of interest may grow exponentially. Traditional ML methods either learned ● Learned too simple mappings between x and y (e.g. linear regression), or ● learned local regions of the training data (k-NN, random forest, SVM) → do not generalize well to unseen regions of input space! Assumption of smoothness / local constancy does not hold for complex problems!
43. 43. Manifold learning Insight (hypothesis): in practice, the distribution of natural images / sounds / language sequences / ... occupies a very little volume in their total space. Concentration of probability distributions! These manifolds (sub-spaces) constitute useful dimensions of variations (e.g. lighting, rotation, size, etc of the same object). Some fascinating tools that utilize the manifold learning capabilities of deep neural networks: https://distill.pub/2017/aia/
44. 44. Stochastic Gradient Descent Problem: we need many examples to learn complex models that are useful in the real-world. However, iterative Gradient Descent then becomes too slow! Solution: Use a minibatch of examples (sampled uniformly from the training data) Cornerstone of modern Deep Learning! More on optimization in Chapter 8. https://medium.com/ai-society/hello-gradient-descent-ef74434bdfa5
45. 45. What makes DL different from classical ML methods? ● Scales better with more data / model capacity (# of parameters) ● Lends itself better for end-to-end supervised learning ● A nice presentation on this by Andrew Ng, plus practical advice to ML development: ○ https://www.youtube.com/watch?v=F1ka6a1 3S9I
46. 46. The End