SlideShare ist ein Scribd-Unternehmen logo
1 von 70
Downloaden Sie, um offline zu lesen
Introduction to Statistical Machine Learning     -1-    Marcus Hutter




                      Introduction to
     Statistical Machine Learning

                                Marcus Hutter
                       Canberra, ACT, 0200, Australia
                  Machine Learning Summer School
                  MLSS-2008, 2 – 15 March, Kioloa


                          ANU         RSISE    NICTA
Introduction to Statistical Machine Learning   -2-         Marcus Hutter

                                  Abstract
This course provides a broad introduction to the methods and practice
of statistical machine learning, which is concerned with the development
of algorithms and techniques that learn from observed data by
constructing stochastic models that can be used for making predictions
and decisions. Topics covered include Bayesian inference and maximum
likelihood modeling; regression, classification, density estimation,
clustering, principal component analysis; parametric, semi-parametric,
and non-parametric models; basis functions, neural networks, kernel
methods, and graphical models; deterministic and stochastic
optimization; overfitting, regularization, and validation.
Introduction to Statistical Machine Learning   -3-    Marcus Hutter

                         Table of Contents
               1. Introduction / Overview / Preliminaries
               2. Linear Methods for Regression
               3. Nonlinear Methods for Regression
               4. Model Assessment & Selection
               5. Large Problems
               6. Unsupervised Learning
               7. Sequential & (Re)Active Settings
               8. Summary
Intro/Overview/Preliminaries        -4-        Marcus Hutter



  1     INTRO/OVERVIEW/PRELIMINARIES

      • What is Machine Learning? Why Learn?
      • Related Fields
      • Applications of Machine Learning
      • Supervised↔Unsupervised↔Reinforcement Learning
      • Dichotomies in Machine Learning
      • Mini-Introduction to Probabilities
Intro/Overview/Preliminaries              -5-              Marcus Hutter

                What is Machine Learning?
     Machine Learning is concerned with the development of
     algorithms and techniques that allow computers to learn

Learning in this context is the process of gaining understanding by
constructing models of observed data with the intention to use them for
prediction.
                               Related fields
  • Artificial Intelligence: smart algorithms
  • Statistics: inference from a sample
  • Data Mining: searching through large volumes of data
  • Computer Science: efficient algorithms and complex models
Intro/Overview/Preliminaries              -6-                    Marcus Hutter

                               Why ’Learn’ ?
              There is no need to “learn” to calculate payroll

                           Learning is used when:
  • Human expertise does not exist (navigating on Mars),
  • Humans are unable to explain their expertise (speech recognition)
  • Solution changes in time (routing on a computer network)
  • Solution needs to be adapted to particular cases (user biometrics)


Example: It is easier to write a program that learns to play checkers or
backgammon well by self-play rather than converting the expertise of a
master player to a program.
Intro/Overview/Preliminaries            -7-               Marcus Hutter

         Handwritten Character Recognition
            an example of a difficult machine learning problem




Task: Learn general mapping from pixel images to digits from examples
Intro/Overview/Preliminaries              -8-              Marcus Hutter

          Applications of Machine Learning
    machine learning has a wide spectrum of applications including:
  • natural language processing,
  • search engines,
  • medical diagnosis,
  • detecting credit card fraud,
  • stock market analysis,
  • bio-informatics, e.g. classifying DNA sequences,
  • speech and handwriting recognition,
  • object recognition in computer vision,
  • playing games – learning by self-play: Checkers, Backgammon.
  • robot locomotion.
Intro/Overview/Preliminaries     -9-               Marcus Hutter

       Some Fundamental Types of Learning

  • Supervised Learning        • Reinforcement Learning
    Classification                Agents
    Regression


  • Unsupervised Learning      • Others
    Association                  SemiSupervised Learning
    Clustering                   Active Learning
    Density Estimation
Intro/Overview/Preliminaries            - 10 -                 Marcus Hutter

                      Supervised Learning

  • Prediction of future cases:
    Use the rule to predict the output for future inputs

  • Knowledge extraction:
    The rule is easy to understand

  • Compression:
    The rule is simpler than the data it explains

  • Outlier detection:
    Exceptions that are not covered by the rule, e.g., fraud
Intro/Overview/Preliminaries            - 11 -   Marcus Hutter

                               Classification

Example:
Credit scoring


Differentiating between
low-risk and high-risk
customers from their
Income and Savings



Discriminant: IF income > θ1 AND savings > θ2
               THEN low-risk ELSE high-risk
Intro/Overview/Preliminaries           - 12 -             Marcus Hutter

                               Regression
   Example: Price y = f (x)+noise of a used car as function of age x
Intro/Overview/Preliminaries           - 13 -   Marcus Hutter

                    Unsupervised Learning

  • Learning “what normally happens”

  • No output

  • Clustering: Grouping similar instances

  • Example applications:
    Customer segmentation in CRM
    Image compression: Color quantization
    Bioinformatics: Learning motifs
Intro/Overview/Preliminaries              - 14 -   Marcus Hutter

                   Reinforcement Learning
  • Learning a policy: A sequence of outputs

  • No supervised output but delayed reward

  • Credit assignment problem

  • Game playing

  • Robot in a maze

  • Multiple agents, partial observability, ...
Intro/Overview/Preliminaries              - 15 -            Marcus Hutter

          Dichotomies in Machine Learning

          scope of my lecture         ⇔     scope of other lectures
   (machine) learning / statistical   ⇔     logic/knowledge-based (GOFAI)
      induction ⇔ prediction          ⇔     decision ⇔ action
                         regression   ⇔     classification
independent identically distributed   ⇔     sequential / non-iid
                   online learning    ⇔     offline/batch learning
                passive prediction    ⇔     active learning
                        parametric    ⇔     non-parametric
        conceptual/mathematical       ⇔     computational issues
                  exact/principled    ⇔     heuristic
               supervised learning    ⇔     unsupervised ⇔ RL learning
Intro/Overview/Preliminaries               - 16 -        Marcus Hutter

                        Probability Basics
          Probability is used to describe uncertain events;
       the chance or belief that something is or will be true.
Example: Fair Six-Sided Die:
  • Sample space: Ω = {1, 2, 3, 4, 5, 6}
  • Events: Even= {2, 4, 6}, Odd= {1, 3, 5}         ⊆Ω
  • Probability: P(6) = 1 , P(Even) = P(Odd) =
                        6
                                                     1
                                                     2
  • Outcome: 6 ∈ E.
                                         P (6and Even)   1/6   1
 • Conditional probability: P (6|Even) =               =     =
                                            P (Even)     1/2   3
General Axioms:
 • P({}) = 0 ≤ P(A) ≤ 1 = P(Ω),
 • P(A ∪ B) + P(A ∩ B) = P(A) + P(B),
 • P(A ∩ B) = P(A|B)P(B).
Intro/Overview/Preliminaries                  - 17 -                   Marcus Hutter

                        Probability Jargon
Example: (Un)fair coin: Ω = {Tail,Head} {0, 1}. P(1) = θ ∈ [0, 1]:
Likelihood: P(1101|θ) = θ × θ × (1 − θ) × θ
                                    ˆ
Maximum Likelihood (ML) estimate: θ = arg maxθ P(1101|θ) = 3                4
Prior: If we are indifferent, then P(θ) =const.
                                            1
Evidence: P(1101) = θ P(1101|θ)P(θ) = 20 (actually                 )
                           P(1101|θ)P(θ)
Posterior: P(θ|1101) =        P(1101)      ∝ θ3 (1 − θ) (BAYES RULE!).
                                    ˆ
Maximum a Posterior (MAP) estimate: θ = arg maxθ P(θ|1101) =                    3
                                                                                4
                                           P(11011)        2
Predictive distribution: P(1|1101) =       P(1101)     =   3
                                                                   2
Expectation: E[f |...] =       θ f (θ)P(θ|...), e.g. E[θ|1101] =   3
                                                2
Variance: Var(θ) = E[(θ − Eθ)2 |1101] =         63
Probability density: P(θ) = 1 P([θ, θ + ε]) for ε → 0
                            ε
Linear Methods for Regression         - 18 -           Marcus Hutter



2      LINEAR METHODS FOR REGRESSION

               • Linear Regression
               • Coefficient Subset Selection
               • Coefficient Shrinkage
               • Linear Methods for Classifiction
               • Linear Basis Function Regression (LBFR)
               • Piecewise linear, Splines, Wavelets
               • Local Smoothing & Kernel Regression
               • Regularization & 1D Smoothing Splines
Linear Methods for Regression               - 19 -              Marcus Hutter

                         Linear Regression
                     fitting a linear function to the data
  • Input “feature” vector x := (1 ≡ x(0) , x(1) , ..., x(d) ) ∈ I d+1
                                                                 R
  • Real-valued noisy response y ∈ I
                                   R.
  • Linear regression model:
    y = fw (x) = w0 x(0) + ... + wd x(d)
    ˆ
  • Data: D = (x1 , y1 ), ..., (xn , yn )
  • Error or loss function:
    Example: Residual sum of squares:
                   n
    Loss(w) = i=1 (yi − fw (xi ))2
  • Least squares (LSQ) regression:
    w = arg minw Loss(w)
    ˆ
  • Example: Person’s weight y as a function of age x1 , height x2 .
Linear Methods for Regression             - 20 -            Marcus Hutter

               Coefficient Subset Selection

Problems with least squares regression if d is large:

  • Overfitting: The plane fits the data well (perfect for d ≥ n),
    but predicts (generalizes) badly.

  • Interpretation: We want to identify a small subset of
    features important/relevant for predicting y.

Solution 1: Subset selection:
Take those k out of d features that minimize the LSQ error.
Linear Methods for Regression            - 21 -             Marcus Hutter

                     Coefficient Shrinkage
Solution 2: Shrinkage methods:
Shrink the least squares w
by penalizing the Loss:

Ridge regression: Add ∝ ||w||2 .
                             2

Lasso: Add ∝ ||w||1 .

Bayesian linear regression:
Comp. MAP arg maxw P(w|D)
from prior P (w) and
sampling model P (D|w).


                         Weights of low variance components shrink most.
Linear Methods for Regression              - 22 -                 Marcus Hutter

           Linear Methods for Classification
Example: Y = {spam,non-spam}         {−1, 1}        (or {0, 1})
Reduction to regression: Regard y ∈ I ⇒ w from linear regression.
                                    R   ˆ
Binary classification: If fw (x) > 0 then y = 1 else y = −1.
                          ˆ              ˆ          ˆ
Probabilistic classification: Predict probability that new x is in class y.
                    P(y=1|x,D)
    Log-odds log P(y=0|x,D) := fw (x)
                                   ˆ

Improvements:
  • Linear Discriminant Analysis (LDA)
  • Logistic Regression
  • Perceptron
  • Maximum Margin Hyperplane
  • Support Vector Machine
Generalization to non-binary Y possible.
Linear Methods for Regression            - 23 -                Marcus Hutter

    Linear Basis Function Regression (LBFR)
     = powerful generalization of and reduction to linear regression!
Problem: Response y ∈ I is often not linear in x ∈ I d .
                      R                            R
Solution: Transform x ; φ(x) with φ : I d → I p .
                                      R     R
         Assume/hope response y ∈ I is now linear in φ.
                                  R
Examples:
  • Linear regression: p = d and φi (x) = xi .
  • Polynomial regression: d = 1 and φi (x) = xi .
  • Piecewise constant regression:
    E.g. d = 1 with φi (x) = 1 for i ≤ x < i + 1 and 0 else.
  • Piecewise polynomials ...
  • Splines ...
Linear Methods for Regression   - 24 -   Marcus Hutter
Linear Methods for Regression   - 25 -   Marcus Hutter

2D Spline LBFR and 1D Symmlet-8 Wavelets
Linear Methods for Regression             - 26 -              Marcus Hutter

       Local Smoothing & Kernel Regression
Estimate f (x) by averaging the yi for all xi a-close to x:
               Pn
     ˆ
     f (x) =   Pn K(x,xi )yi
                i=1
                        K(x,xi )
                  i=1
                                   Generalization to other K,
Nearest-Neighbor Kernel:           e.g. quadratic (Epanechnikov) Kernel:
K(x, xi ) = 1 if { |x−xi |<a
                    and 0 else     K(x, xi ) = max{0, a2 − (x − xi )2 }
Linear Methods for Regression           - 27 -            Marcus Hutter

     Regularization & 1D Smoothing Splines
               to avoid overfitting if function class is large
           ˆ = arg minf { n (yi −f (xi ))2 + λ (f (x))2 dx}
           f                i=1

  • λ=0
        ˆ
    ⇒ f =any function
    through data
  • λ=∞
       ˆ
    ⇒ f =least
    squares line fit
  • 0<λ<∞
        ˆ
    ⇒ f =piecwise cubic
    with continuous
    derivative
Nonlinear Regression              - 28 -         Marcus Hutter



            3     NONLINEAR REGRESSION

                • Artificial Neural Networks
                • Kernel Trick
                • Maximum Margin Classifier
                • Sparse Kernel Methods / SVMs
Nonlinear Regression                          - 29 -         Marcus Hutter

                Artificial Neural Networks 1
                       as non-linear function approximator
          Single hidden layer feed-
          forward neural network
                                 d      (1)
  • Hidden layer: zj = h(        i=0   wji xi )
                                 M       (2)
  • Output: fw,k (x) = σ(        j=0    wkj zj )
  • Sigmoidal activation functions:
    h() and σ() ⇒ f non-linear
  • Goal: Find network weights
    best modeling the data:
  • Back-propagation algorithm:
                n
    Minimize i=1 ||y i − f w (xi )||2
                                    2
    w.r.t. w by gradient decent.
Nonlinear Regression                - 30 -            Marcus Hutter

                Artificial Neural Networks 2
  • Avoid overfitting by early stopping or small M .
  • Avoid local minima by random initial weights
    or stochastic gradient descent.

                       Example: Image Processing
Nonlinear Regression                   - 31 -                         Marcus Hutter

                          Kernel Trick

          The Kernel-Trick allows to reduce the functional
          minimization to finite-dimensional optimization.


  • Let L(y, f (x)) be any loss function
  • and J(f ) be a penalty quadratic in f .
                                             n
  • then minimum of penalized loss           i=1   L(yi , f (xi )) + λJ(f )
                        n
  • has form f (x) =    i=1   αi K(x, xi )
                          n
  • with α minimizing     i=1   L(yi , (Kα)i ) + λα Kα.
  • and Kernel K ij = K(xi , xj ) following from J.
Nonlinear Regression                    - 32 -                Marcus Hutter

                Maximum Margin Classifier

        ˆ
        w = arg        max    min{yi (w φ(xi ))} with   y ∈ {−1, 1}
                  w:||w||=1    i




  • Linear boundary for
    φb (x) = x(b) .
  • Boundary is determined by
    Support Vectors
    (circled data points)
  • Margin negative if
    classes not separable.
Nonlinear Regression                  - 33 -   Marcus Hutter

              Sparse Kernel Methods / SVMs
Non-linear boundary for general φb (x)
        n
w=
ˆ       i=1   ai φ(xi ) for some a.
  ˆ                   n
⇒ f (x) = w φ(x) = i=1 ai K(xi , x)
           ˆ
   depends only on φ via Kernel
                 d
   K(xi , x) = b=1 φb (xi )φb (x).
⇒ Huge time savings if d        n


Example K(x, x ):
 • polynomial (1 + x, x )d ,
 • Gaussian exp(−||x − x ||2 ),
                           2
 • neural network tanh( x, x ).
Model Assessment & Selection         - 34 -       Marcus Hutter



 4     MODEL ASSESSMENT & SELECTION

       • Example: Polynomial Regression
       • Training=Empirical versus Test=True Error
       • Empirical Model Selection
       • Theoretical Model Selection
       • The Loss Rank Principle for Model Selection
Model Assessment & Selection            - 35 -                  Marcus Hutter

           Example: Polynomial Regression

 • Straight line does not fit data well (large training error)
   high bias ⇒ poor predictive performance

 • High order polynomial fits data perfectly (zero training error)
   high variance (overfitting)
   ⇒ poor prediction too!

 • Reasonable polynomial
   degree d performs well.
   How to select d?
   minimizing training error
   obviously does not work.
Model Assessment & Selection               - 36 -               Marcus Hutter

  Training=Empirical versus Test=True Error
 • Learn functional relation f for data D = {(x1 , y1 ), ..., (xn , yn )}.
 • We can compute the empirical error on past data:
               1 n
   ErrD (f ) = n i=1 (yi − f (xi ))2 .
 • Assume data D is sample from some distribution P.
 • We want to know the expected true error on future examples:
   ErrP (f ) = EP [(y − f (x))].
 • How good an estimate of ErrP (f ) is ErrD (f ) ?
 • Problem: ErrD (f ) decreases with increasing model complexity,
   but not ErrP (f ).
Model Assessment & Selection              - 37 -   Marcus Hutter

                 Empirical Model Selection
How to select complexity parameter
 • Kernel width a,
 • penalization constant λ,
 • number k of nearest neighbors,
 • the polynomial degree d?
Empirical test-set-based methods:
 Regress on training set and mini-
 mize empirical error w.r.t. “com-
 plexity” parameter (a, λ, k, d) on
 a separate test-set.
Sophistication: cross-validation, bootstrap, ...
Model Assessment & Selection             - 38 -           Marcus Hutter

               Theoretical Model Selection
How to select complexity or flexibility or smoothness parameter:
Kernel width a, penalization constant λ, number k of nearest neighbors,
the polynomial degree d?
For   parametric regression with d parameters:
  •   Bayesian model selection,
  •   Akaike Information Criterion (AIC),
  •   Bayesian Information Criterion (BIC),
  •   Minimum Description Length (MDL),

They all add a penalty proportional to d to the loss.
For non-parametric linear regression:
  • Add trace of on-data regressor = effective # of parameters to loss.
  • Loss Rank Principle (LoRP).
Model Assessment & Selection            - 39 -             Marcus Hutter

The Loss Rank Principle for Model Selection
    ˆc
Let fD : X → Y be the (best) regressor of complexity c on data D.
                 ˆc
The loss Rank of fD is defined as the number of other (fictitious) data
                           ˆc                     ˆc
D that are fitted better by fD than D is fitted by fD .
                ˆc
 • c is small ⇒ fD fits D badly ⇒ many other D can be fitted better
   ⇒ Rank is large.
 • c is large ⇒ many D can be fitted well ⇒ Rank is large.
                      ˆc
 • c is appropriate ⇒ fD fits D well and not too many other D
   can be fitted well ⇒ Rank is small.
     LoRP: Select model complexity c that has minimal loss Rank
Unlike most penalized maximum likelihood variants (AIC,BIC,MDL),
 • LoRP only depends on the regression and the loss function.
 • It works without a stochastic noise model, and
 • is directly applicable to any non-parametric regressor, like kNN.
How to Attack Large Problems        - 40 -            Marcus Hutter



5     HOW TO ATTACK LARGE PROBLEMS

              • Probabilistic Graphical Models (PGM)
              • Trees Models
              • Non-Parametric Learning
              • Approximate (Variational) Inference
              • Sampling Methods
              • Combining Models
              • Boosting
How to Attack Large Problems                - 41 -              Marcus Hutter

      Probabilistic Graphical Models (PGM)
Visualize structure of model and (in)dependence
⇒ faster and more comprehensible algorithms
  • Nodes = random variables.
  • Edges = stochastic dependencies.
  • Bayesian network = directed PGM
  • Markov random field = undirected PGM
Example:
   P(x1 )P(x2 )P(x3 )P(x4 |x1 x2 x3 )P(x5 |x1 x3 )P (x6 |x4 )P(x7 |x4 x5 )
How to Attack Large Problems             - 42 -             Marcus Hutter

Additive Models & Trees & Related Methods
    Generalized additive model: f (x) = α + f1 (x1 ) + ... + fd (xd )
   Reduces determining f : I d → I to d 1d functions fb : I → I
                           R       R                         R       R

        Classification/decision trees:


                                                   Outlook:
                                                    • PRIM,
                                                    • bump hunting,
                                                    • How to learn
                                                       tree structures.
How to Attack Large Problems             - 43 -              Marcus Hutter

                        Regression Trees
          f (x) = cb for x ∈ Rb , and cb = Average[yi |xi ∈ Rb ]
How to Attack Large Problems           - 44 -              Marcus Hutter

                 Non-Parametric Learning
     = prototype methods = instance-based learning = model-free
Examples:
 • K-means:
   Data clusters around K centers with cluster means µ1 , ...µK .
   Assign xi to closest cluster center.
 • K Nearest neighbors regression (kNN):
   Estimate f (x) by averaging the yi
   for the k xi closest to x:

 • Kernel regression:
                                       n
                            ˆ          i=1 K(x, xi )yi
    Take a weighted average f (x) =     n              .
                                        i=1 K(x, xi )
How to Attack Large Problems   - 45 -   Marcus Hutter
How to Attack Large Problems            - 46 -          Marcus Hutter

        Approximate (Variational) Inference
Approximate full distribution P(z) by q(z)       Examples: Gaussians
Popular: Factorized distribution:
q(z) = q1 (z1 ) × ... × qM (zM ).
Measure of fit: Relative entropy
KL(p||q) = p(z) log p(z) dz
                      q(z)



Red curves: Left min-
imizes KL(P||q), Mid-
dle and Right are the
two local minima of
KL(q||P).
How to Attack Large Problems             - 47 -               Marcus Hutter

            Elementary Sampling Methods
                   How to sample from P : Z → [0, 1]?

 • Special sampling algorithms for standard distributions P.

 • Rejection sampling: Sample z uniformly from domain Z,
   but accept sample only with probability ∝ P(z).

 • Importance sampling:
                            1    L
   E[f ] = f (z)p(z)dz L l=1 f (z l )p(z l )/q(z l ),
   where z l are sampled from q.
   Choose q easy to sample and large where f (z l )p(z l ) is large.

 • Others: Slice sampling
How to Attack Large Problems            - 48 -              Marcus Hutter

Markov Chain Monte Carlo (MCMC) Sampling
Metropolis: Choose some conve-
nient q with q(z|z ) = q(z |z).
Sample z l+1 from q(·|z l ) but
accept only with probability
min{1, p(zl+1 )/p(z l )}.

Gibbs Sampling: Metropolis with
q leaving z unchanged from l ;
l + 1,except resample coordinate
i from P(zi |z i ).

    Green lines are accepted and red lines are rejected Metropolis steps.
How to Attack Large Problems              - 49 -              Marcus Hutter

                       Combining Models
                 Performance can often be improved
             by combining multiple models in some way,
           instead of just using a single model in isolation

 • Committees: Average the predictions of a set of individual models.
 • Bayesian model averaging: P(x) =         Models   P(x|Model)P(Model)
 • Mixture models: P(x|θ, π) =      k   πk Pk (x|θk )
 • Decision tree: Each model is responsible
   for a different part of the input space.
 • Boosting: Train many weak classifiers in sequence
   and combine output to produce a powerful committee.
How to Attack Large Problems               - 50 -               Marcus Hutter

                               Boosting
Idea: Train many weak classifiers Gm in sequence and
combine output to produce a powerful committee G.
AdaBoost.M1: [Freund & Schapire (1997) received famous G¨del-Prize]
                                                        o

Initialize observation weights wi uniformly.
For m = 1 to M do:
(a) Gm classifies x as Gm (x) ∈ {−1, 1}.
    Train Gm weighing data i with wi .
(b) Give Gm high/low weight αi if it performed well/bad.
(c) Increase attention=weight wi for obs. xi misclassified by Gm .
                                                    M
Output weighted majority vote: G(x) = sign(         m=1   αm Gm (x))
Unsupervised Learning             - 51 -              Marcus Hutter



          6     UNSUPERVISED LEARNING

               • K-Means Clustering
               • Mixture Models
               • Expectation Maximization Algorithm
Unsupervised Learning                 - 52 -     Marcus Hutter

                    Unsupervised Learning
Supervised learning: Find functional relation-
ship f : X → Y from I/O data {(xi , yi )}

Unsupervised learning:
Find pattern in data (x1 , ..., xn )
without an explicit teacher (no y values).

Example: Clustering e.g. by K-means

Implicit goal: Find simple explanation,
i.e. compress data (MDL, Occam’s razor).

Density estimation: From which probability
distribution P are the xi drawn?
Unsupervised Learning                - 53 -              Marcus Hutter

               K-means Clustering and EM
 • Data points seem to cluster around two centers.
 • Assign each data point i to a cluster ki .
 • Let µk = center of cluster k.
 • Distortion measure: Total distance2
   of data points from cluster centers:
                 n
   J(k, µ) := i=1 ||xi − µki ||2
 • Choose centers µk initially at random.
 • M-step: Minimize J w.r.t. k:
   Assign each point to closest center
 • E-step: Minimize J w.r.t. µ:
   Let µk be the mean of points belonging to cluster k
 • Iteration of M-step and E-step converges to local minimum of J.
Unsupervised Learning   - 54 -       Marcus Hutter

       Iterations of K-means EM Algorithm
Unsupervised Learning                - 55 -    Marcus Hutter

                  Mixture Models and EM
Mixture of Gaussians:
                K
P(x|πµΣ) = i=1 Gauss(x|µk , Σk )πk
Maximize likelihood P(x|...) w.r.t. π, µ, Σ.


E-Step: Compute proba-
bility γik that data point i
belongs to cluster k, based
on estimates π, µ, Σ.

M-Step: Re-estimate π,
µ, Σ (take empirical
mean/variance) given γik .
Non-IID: Sequential & (Re)Active Settings   - 56 -   Marcus Hutter



7     NON-IID: SEQUENTIAL & (RE)ACTIVE
                  SETTINGS

               • Sequential Data
               • Sequential Decision Theory
               • Learning Agents
               • Reinforcement Learning (RL)
               • Learning in Games
Non-IID: Sequential & (Re)Active Settings        - 57 -             Marcus Hutter

                          Sequential Data
                                                   n
General stochastic time-series: P(x1 ...xn ) =     i=1    P(xi |x1 ...xi−1 )
Independent identically distributed (roulette,dice,classification):
P(x1 ...xn ) = P(x1 )P(x2 )...P(xn )
First order Markov chain (Backgammon):
P(x1 ...xn ) = P(x1 )P(x2 |x1 )P(x3 |x2 )...P(xn |xn−1 )
Second order Markov chain (mechanical systems):
P(x1 ...xn ) = P(x1 )P(x2 |x1 )P(x3 |x1 x2 )×
              ... × P(xn |xn−1 xn−2 )
Hidden Markov model (speech recognition):
P(x1 ...xn ) = P(z1 )P(z2 |z1 )...P(zn |zn−1 )
                 n
              × i=1 P(xi |zi )dz
Non-IID: Sequential & (Re)Active Settings         - 58 -           Marcus Hutter

                Sequential Decision Theory
Setup:                                      Example: Weather Forecasting
    For t = 1, 2, 3, 4, ...                 xt ∈ X = {sunny, rainy}
    Given sequence x1 , x2 , ..., xt−1      yt ∈ Y = {umbrella, sunglasses}
    (1) predict/make decision yt ,              Loss       sunny     rainy
    (2) observe xt ,
                                              umbrella      0.1       0.3
    (3) suffer loss Loss(xt , yt ),
    (4) t → t + 1, goto (1)                  sunglasses     0.0       1.0

Goal: Minimize expected Loss: yt = arg minyt E[Loss(xt , yt )|x1 ...xt−1 ]
                              ˆ
Greedy minimization of expected loss is optimal if:
Important: Decision yt does not influence env. (future observations).
              Loss    =    square / absolute / 0-1 error     function
Examples:
               y
               ˆ      =     mean / median / mode             of P(xt | · · · )
Non-IID: Sequential & (Re)Active Settings         - 59 -           Marcus Hutter

                        Learning Agents 1
                                             sensors

                               percepts
                                                           ?
              environment
                                                           agent
                                actions


                                 actuators

                       Additional complication:
                  Learner can influence environment,
                   and hence what he observes next.
          ⇒ farsightedness, planning, exploration necessary.
                    Exploration ⇔ Exploitation problem
Non-IID: Sequential & (Re)Active Settings         - 60 -   Marcus Hutter

                         Learning Agents 2
      Performance standard


                Critic                       Sensors


      feedback




                                                                Environment
                            changes
             Learning                       Performance
             element                         element
                           knowledge
        learning
         goals

             Problem
             generator

   Agent                                     Actuators
Non-IID: Sequential & (Re)Active Settings               - 61 -          Marcus Hutter

              Reinforcement Learning (RL)
                 r1 | o1 r2 | o2 r3 | o3 r4 | o4 r5 | o5 r6 | o6 ...
                                  ¨¨             ‰
                                                 r
                                                 rr
                            ¨
                            %
                            ¨                      r
                          Agent                    Environ-
                            p                      ment q
                           €€                       
                                                    I
                                                   
                              €     €€         
                                     €
                                     q
                   y1      y2      y3       y4     y5        y6   ...


  • RL is concerned with how an agent ought to take actions
    in an environment so as to maximize its long-term reward.
  • Find policy that maps states of the world
    to the actions the agent ought to take in those states.
  • The environment is typically formulated
    as a finite-state Markov decision process (MDP).
Non-IID: Sequential  (Re)Active Settings   - 62 -   Marcus Hutter

                        Learning in Games
  • Learning though self-play.
  • Backgammon (TD-Gammon).
  • Samuel’s checker program.
Summary                 - 63 -                 Marcus Hutter



                  8     SUMMARY

          • Important Loss Functions
          • Learning Algorithm Characteristics Comparison
          • More Learning
          • Data Sets
          • Journals
          • Annual Conferences
          • Recommended Literature
Summary            - 64 -            Marcus Hutter

          Important Loss Functions
Summary            - 65 -           Marcus Hutter

Learning Algorithm Characteristics Comparison
Summary                     - 66 -                Marcus Hutter

                         More Learning
 • Concept Learning
 • Bayesian Learning
 • Computational Learning Theory (PAC learning)
 • Genetic Algorithms
 • Learning Sets of Rules
 • Analytical Learning
Summary                       - 67 -                      Marcus Hutter

                          Data Sets
 • UCI Repository:
   http://www.ics.uci.edu/ mlearn/MLRepository.html

 • UCI KDD Archive:
   http://kdd.ics.uci.edu/summary.data.application.html

 • Statlib:
   http://lib.stat.cmu.edu/

 • Delve:
   http://www.cs.utoronto.ca/ delve/
Summary                       - 68 -                    Marcus Hutter

                            Journals
 • Journal of Machine Learning Research
 • Machine Learning
 • IEEE Transactions on Pattern Analysis and Machine Intelligence
 • Neural Computation
 • Neural Networks
 • IEEE Transactions on Neural Networks
 • Annals of Statistics
 • Journal of the American Statistical Association
 • ...
Summary                        - 69 -                      Marcus Hutter

                   Annual Conferences
 • Algorithmic Learning Theory (ALT)
 • Computational Learning Theory (COLT)
 • Uncertainty in Artificial Intelligence (UAI)
 • Neural Information Processing Systems (NIPS)
 • European Conference on Machine Learning (ECML)
 • International Conference on Machine Learning (ICML)
 • International Joint Conference on Artificial Intelligence (IJCAI)
 • International Conference on Artificial Neural Networks (ICANN)
Summary                          - 70 -                         Marcus Hutter

                 Recommended Literature

[Bis06]   C. M. Bishop. Pattern Recognition and Machine Learning.
          Springer, 2006.

[HTF01] T. Hastie, R. Tibshirani, and J. H. Friedman.
        The Elements of Statistical Learning. Springer, 2001.

[Alp04]   E. Alpaydin. Introduction to Machine Learning.
          MIT Press, 2004.

[Hut05] M. Hutter. Universal Artificial Intelligence:
        Sequential Decisions based on Algorithmic Probability.
        Springer, Berlin, 2005. http://www.hutter1.net/ai/uaibook.htm.

Weitere ähnliche Inhalte

Was ist angesagt?

Statistics vs machine learning
Statistics vs machine learningStatistics vs machine learning
Statistics vs machine learningTom Dierickx
 
Optimization for Deep Learning
Optimization for Deep LearningOptimization for Deep Learning
Optimization for Deep LearningSebastian Ruder
 
Machine learning session4(linear regression)
Machine learning   session4(linear regression)Machine learning   session4(linear regression)
Machine learning session4(linear regression)Abhimanyu Dwivedi
 
NAIVE BAYES CLASSIFIER
NAIVE BAYES CLASSIFIERNAIVE BAYES CLASSIFIER
NAIVE BAYES CLASSIFIERKnoldus Inc.
 
What is the Expectation Maximization (EM) Algorithm?
What is the Expectation Maximization (EM) Algorithm?What is the Expectation Maximization (EM) Algorithm?
What is the Expectation Maximization (EM) Algorithm?Kazuki Yoshida
 
Machine Learning Algorithms | Machine Learning Tutorial | Data Science Algori...
Machine Learning Algorithms | Machine Learning Tutorial | Data Science Algori...Machine Learning Algorithms | Machine Learning Tutorial | Data Science Algori...
Machine Learning Algorithms | Machine Learning Tutorial | Data Science Algori...Simplilearn
 
Lecture 4 Decision Trees (2): Entropy, Information Gain, Gain Ratio
Lecture 4 Decision Trees (2): Entropy, Information Gain, Gain RatioLecture 4 Decision Trees (2): Entropy, Information Gain, Gain Ratio
Lecture 4 Decision Trees (2): Entropy, Information Gain, Gain RatioMarina Santini
 
Artificial Neural Networks - ANN
Artificial Neural Networks - ANNArtificial Neural Networks - ANN
Artificial Neural Networks - ANNMohamed Talaat
 
Feed forward ,back propagation,gradient descent
Feed forward ,back propagation,gradient descentFeed forward ,back propagation,gradient descent
Feed forward ,back propagation,gradient descentMuhammad Rasel
 
Feedforward neural network
Feedforward neural networkFeedforward neural network
Feedforward neural networkSopheaktra YONG
 
Logistic regression
Logistic regressionLogistic regression
Logistic regressionsaba khan
 
Machine Learning with Decision trees
Machine Learning with Decision treesMachine Learning with Decision trees
Machine Learning with Decision treesKnoldus Inc.
 
Introduction to Machine Learning
Introduction to Machine LearningIntroduction to Machine Learning
Introduction to Machine LearningLior Rokach
 
Linear regression with gradient descent
Linear regression with gradient descentLinear regression with gradient descent
Linear regression with gradient descentSuraj Parmar
 
2.5 backpropagation
2.5 backpropagation2.5 backpropagation
2.5 backpropagationKrish_ver2
 
Introduction to Deep Learning
Introduction to Deep LearningIntroduction to Deep Learning
Introduction to Deep LearningOswald Campesato
 
Recurrent Neural Network (RNN) | RNN LSTM Tutorial | Deep Learning Course | S...
Recurrent Neural Network (RNN) | RNN LSTM Tutorial | Deep Learning Course | S...Recurrent Neural Network (RNN) | RNN LSTM Tutorial | Deep Learning Course | S...
Recurrent Neural Network (RNN) | RNN LSTM Tutorial | Deep Learning Course | S...Simplilearn
 

Was ist angesagt? (20)

Statistics vs machine learning
Statistics vs machine learningStatistics vs machine learning
Statistics vs machine learning
 
Optimization for Deep Learning
Optimization for Deep LearningOptimization for Deep Learning
Optimization for Deep Learning
 
Machine learning session4(linear regression)
Machine learning   session4(linear regression)Machine learning   session4(linear regression)
Machine learning session4(linear regression)
 
NAIVE BAYES CLASSIFIER
NAIVE BAYES CLASSIFIERNAIVE BAYES CLASSIFIER
NAIVE BAYES CLASSIFIER
 
What is the Expectation Maximization (EM) Algorithm?
What is the Expectation Maximization (EM) Algorithm?What is the Expectation Maximization (EM) Algorithm?
What is the Expectation Maximization (EM) Algorithm?
 
Machine Learning Algorithms | Machine Learning Tutorial | Data Science Algori...
Machine Learning Algorithms | Machine Learning Tutorial | Data Science Algori...Machine Learning Algorithms | Machine Learning Tutorial | Data Science Algori...
Machine Learning Algorithms | Machine Learning Tutorial | Data Science Algori...
 
Lecture 4 Decision Trees (2): Entropy, Information Gain, Gain Ratio
Lecture 4 Decision Trees (2): Entropy, Information Gain, Gain RatioLecture 4 Decision Trees (2): Entropy, Information Gain, Gain Ratio
Lecture 4 Decision Trees (2): Entropy, Information Gain, Gain Ratio
 
Logistic regression
Logistic regressionLogistic regression
Logistic regression
 
Artificial Neural Networks - ANN
Artificial Neural Networks - ANNArtificial Neural Networks - ANN
Artificial Neural Networks - ANN
 
Feed forward ,back propagation,gradient descent
Feed forward ,back propagation,gradient descentFeed forward ,back propagation,gradient descent
Feed forward ,back propagation,gradient descent
 
Feedforward neural network
Feedforward neural networkFeedforward neural network
Feedforward neural network
 
Logistic regression
Logistic regressionLogistic regression
Logistic regression
 
Naive bayes
Naive bayesNaive bayes
Naive bayes
 
Machine Learning with Decision trees
Machine Learning with Decision treesMachine Learning with Decision trees
Machine Learning with Decision trees
 
Naive Bayes
Naive BayesNaive Bayes
Naive Bayes
 
Introduction to Machine Learning
Introduction to Machine LearningIntroduction to Machine Learning
Introduction to Machine Learning
 
Linear regression with gradient descent
Linear regression with gradient descentLinear regression with gradient descent
Linear regression with gradient descent
 
2.5 backpropagation
2.5 backpropagation2.5 backpropagation
2.5 backpropagation
 
Introduction to Deep Learning
Introduction to Deep LearningIntroduction to Deep Learning
Introduction to Deep Learning
 
Recurrent Neural Network (RNN) | RNN LSTM Tutorial | Deep Learning Course | S...
Recurrent Neural Network (RNN) | RNN LSTM Tutorial | Deep Learning Course | S...Recurrent Neural Network (RNN) | RNN LSTM Tutorial | Deep Learning Course | S...
Recurrent Neural Network (RNN) | RNN LSTM Tutorial | Deep Learning Course | S...
 

Ähnlich wie Introduction to Statistical Machine Learning

Lecture 2 Basic Concepts in Machine Learning for Language Technology
Lecture 2 Basic Concepts in Machine Learning for Language TechnologyLecture 2 Basic Concepts in Machine Learning for Language Technology
Lecture 2 Basic Concepts in Machine Learning for Language TechnologyMarina Santini
 
Uncertainty Awareness in Integrating Machine Learning and Game Theory
Uncertainty Awareness in Integrating Machine Learning and Game TheoryUncertainty Awareness in Integrating Machine Learning and Game Theory
Uncertainty Awareness in Integrating Machine Learning and Game TheoryRikiya Takahashi
 
Foundations of Machine Learning
Foundations of Machine LearningFoundations of Machine Learning
Foundations of Machine Learningmahutte
 
Predictive Modeling in Insurance in the context of (possibly) big data
Predictive Modeling in Insurance in the context of (possibly) big dataPredictive Modeling in Insurance in the context of (possibly) big data
Predictive Modeling in Insurance in the context of (possibly) big dataArthur Charpentier
 
Machine Learning: Foundations Course Number 0368403401
Machine Learning: Foundations Course Number 0368403401Machine Learning: Foundations Course Number 0368403401
Machine Learning: Foundations Course Number 0368403401butest
 
Fundementals of Machine Learning and Deep Learning
Fundementals of Machine Learning and Deep Learning Fundementals of Machine Learning and Deep Learning
Fundementals of Machine Learning and Deep Learning ParrotAI
 
Machine Learning: Foundations Course Number 0368403401
Machine Learning: Foundations Course Number 0368403401Machine Learning: Foundations Course Number 0368403401
Machine Learning: Foundations Course Number 0368403401butest
 
Machine Learning: Foundations Course Number 0368403401
Machine Learning: Foundations Course Number 0368403401Machine Learning: Foundations Course Number 0368403401
Machine Learning: Foundations Course Number 0368403401butest
 
Machine learning ppt unit one syllabuspptx
Machine learning ppt unit one syllabuspptxMachine learning ppt unit one syllabuspptx
Machine learning ppt unit one syllabuspptxVenkateswaraBabuRavi
 
Introduction to Machine Learning
Introduction to Machine LearningIntroduction to Machine Learning
Introduction to Machine LearningAI Summary
 
DeepLearningLecture.pptx
DeepLearningLecture.pptxDeepLearningLecture.pptx
DeepLearningLecture.pptxssuserf07225
 
know Machine Learning Basic Concepts.pdf
know Machine Learning Basic Concepts.pdfknow Machine Learning Basic Concepts.pdf
know Machine Learning Basic Concepts.pdfhemangppatel
 
060 techniques of_data_analysis
060 techniques of_data_analysis060 techniques of_data_analysis
060 techniques of_data_analysisNouman Zia
 
MS CS - Selecting Machine Learning Algorithm
MS CS - Selecting Machine Learning AlgorithmMS CS - Selecting Machine Learning Algorithm
MS CS - Selecting Machine Learning AlgorithmKaniska Mandal
 
Machine learning and linear regression programming
Machine learning and linear regression programmingMachine learning and linear regression programming
Machine learning and linear regression programmingSoumya Mukherjee
 

Ähnlich wie Introduction to Statistical Machine Learning (20)

Ml ppt at
Ml ppt atMl ppt at
Ml ppt at
 
Lecture 2 Basic Concepts in Machine Learning for Language Technology
Lecture 2 Basic Concepts in Machine Learning for Language TechnologyLecture 2 Basic Concepts in Machine Learning for Language Technology
Lecture 2 Basic Concepts in Machine Learning for Language Technology
 
Uncertainty Awareness in Integrating Machine Learning and Game Theory
Uncertainty Awareness in Integrating Machine Learning and Game TheoryUncertainty Awareness in Integrating Machine Learning and Game Theory
Uncertainty Awareness in Integrating Machine Learning and Game Theory
 
Foundations of Machine Learning
Foundations of Machine LearningFoundations of Machine Learning
Foundations of Machine Learning
 
Predictive Modeling in Insurance in the context of (possibly) big data
Predictive Modeling in Insurance in the context of (possibly) big dataPredictive Modeling in Insurance in the context of (possibly) big data
Predictive Modeling in Insurance in the context of (possibly) big data
 
Machine Learning: Foundations Course Number 0368403401
Machine Learning: Foundations Course Number 0368403401Machine Learning: Foundations Course Number 0368403401
Machine Learning: Foundations Course Number 0368403401
 
Fundementals of Machine Learning and Deep Learning
Fundementals of Machine Learning and Deep Learning Fundementals of Machine Learning and Deep Learning
Fundementals of Machine Learning and Deep Learning
 
Machine learning
Machine learningMachine learning
Machine learning
 
Machine Learning: Foundations Course Number 0368403401
Machine Learning: Foundations Course Number 0368403401Machine Learning: Foundations Course Number 0368403401
Machine Learning: Foundations Course Number 0368403401
 
Machine Learning: Foundations Course Number 0368403401
Machine Learning: Foundations Course Number 0368403401Machine Learning: Foundations Course Number 0368403401
Machine Learning: Foundations Course Number 0368403401
 
Machine learning ppt unit one syllabuspptx
Machine learning ppt unit one syllabuspptxMachine learning ppt unit one syllabuspptx
Machine learning ppt unit one syllabuspptx
 
Introduction to Machine Learning
Introduction to Machine LearningIntroduction to Machine Learning
Introduction to Machine Learning
 
ML unit-1.pptx
ML unit-1.pptxML unit-1.pptx
ML unit-1.pptx
 
DeepLearningLecture.pptx
DeepLearningLecture.pptxDeepLearningLecture.pptx
DeepLearningLecture.pptx
 
Isolation Forest
Isolation ForestIsolation Forest
Isolation Forest
 
know Machine Learning Basic Concepts.pdf
know Machine Learning Basic Concepts.pdfknow Machine Learning Basic Concepts.pdf
know Machine Learning Basic Concepts.pdf
 
060 techniques of_data_analysis
060 techniques of_data_analysis060 techniques of_data_analysis
060 techniques of_data_analysis
 
MS CS - Selecting Machine Learning Algorithm
MS CS - Selecting Machine Learning AlgorithmMS CS - Selecting Machine Learning Algorithm
MS CS - Selecting Machine Learning Algorithm
 
XGBoostLSS - An extension of XGBoost to probabilistic forecasting, Alexander ...
XGBoostLSS - An extension of XGBoost to probabilistic forecasting, Alexander ...XGBoostLSS - An extension of XGBoost to probabilistic forecasting, Alexander ...
XGBoostLSS - An extension of XGBoost to probabilistic forecasting, Alexander ...
 
Machine learning and linear regression programming
Machine learning and linear regression programmingMachine learning and linear regression programming
Machine learning and linear regression programming
 

Kürzlich hochgeladen

Team Lead Succeed – Helping you and your team achieve high-performance teamwo...
Team Lead Succeed – Helping you and your team achieve high-performance teamwo...Team Lead Succeed – Helping you and your team achieve high-performance teamwo...
Team Lead Succeed – Helping you and your team achieve high-performance teamwo...Association for Project Management
 
Transaction Management in Database Management System
Transaction Management in Database Management SystemTransaction Management in Database Management System
Transaction Management in Database Management SystemChristalin Nelson
 
Integumentary System SMP B. Pharm Sem I.ppt
Integumentary System SMP B. Pharm Sem I.pptIntegumentary System SMP B. Pharm Sem I.ppt
Integumentary System SMP B. Pharm Sem I.pptshraddhaparab530
 
ClimART Action | eTwinning Project
ClimART Action    |    eTwinning ProjectClimART Action    |    eTwinning Project
ClimART Action | eTwinning Projectjordimapav
 
BÀI TẬP BỔ TRỢ TIẾNG ANH 11 THEO ĐƠN VỊ BÀI HỌC - CẢ NĂM - CÓ FILE NGHE (GLOB...
BÀI TẬP BỔ TRỢ TIẾNG ANH 11 THEO ĐƠN VỊ BÀI HỌC - CẢ NĂM - CÓ FILE NGHE (GLOB...BÀI TẬP BỔ TRỢ TIẾNG ANH 11 THEO ĐƠN VỊ BÀI HỌC - CẢ NĂM - CÓ FILE NGHE (GLOB...
BÀI TẬP BỔ TRỢ TIẾNG ANH 11 THEO ĐƠN VỊ BÀI HỌC - CẢ NĂM - CÓ FILE NGHE (GLOB...Nguyen Thanh Tu Collection
 
CLASSIFICATION OF ANTI - CANCER DRUGS.pptx
CLASSIFICATION OF ANTI - CANCER DRUGS.pptxCLASSIFICATION OF ANTI - CANCER DRUGS.pptx
CLASSIFICATION OF ANTI - CANCER DRUGS.pptxAnupam32727
 
BIOCHEMISTRY-CARBOHYDRATE METABOLISM CHAPTER 2.pptx
BIOCHEMISTRY-CARBOHYDRATE METABOLISM CHAPTER 2.pptxBIOCHEMISTRY-CARBOHYDRATE METABOLISM CHAPTER 2.pptx
BIOCHEMISTRY-CARBOHYDRATE METABOLISM CHAPTER 2.pptxSayali Powar
 
Beauty Amidst the Bytes_ Unearthing Unexpected Advantages of the Digital Wast...
Beauty Amidst the Bytes_ Unearthing Unexpected Advantages of the Digital Wast...Beauty Amidst the Bytes_ Unearthing Unexpected Advantages of the Digital Wast...
Beauty Amidst the Bytes_ Unearthing Unexpected Advantages of the Digital Wast...DhatriParmar
 
Oppenheimer Film Discussion for Philosophy and Film
Oppenheimer Film Discussion for Philosophy and FilmOppenheimer Film Discussion for Philosophy and Film
Oppenheimer Film Discussion for Philosophy and FilmStan Meyer
 
4.9.24 Social Capital and Social Exclusion.pptx
4.9.24 Social Capital and Social Exclusion.pptx4.9.24 Social Capital and Social Exclusion.pptx
4.9.24 Social Capital and Social Exclusion.pptxmary850239
 
4.9.24 School Desegregation in Boston.pptx
4.9.24 School Desegregation in Boston.pptx4.9.24 School Desegregation in Boston.pptx
4.9.24 School Desegregation in Boston.pptxmary850239
 
Employablity presentation and Future Career Plan.pptx
Employablity presentation and Future Career Plan.pptxEmployablity presentation and Future Career Plan.pptx
Employablity presentation and Future Career Plan.pptxryandux83rd
 
Sulphonamides, mechanisms and their uses
Sulphonamides, mechanisms and their usesSulphonamides, mechanisms and their uses
Sulphonamides, mechanisms and their usesVijayaLaxmi84
 
Decoding the Tweet _ Practical Criticism in the Age of Hashtag.pptx
Decoding the Tweet _ Practical Criticism in the Age of Hashtag.pptxDecoding the Tweet _ Practical Criticism in the Age of Hashtag.pptx
Decoding the Tweet _ Practical Criticism in the Age of Hashtag.pptxDhatriParmar
 
Healthy Minds, Flourishing Lives: A Philosophical Approach to Mental Health a...
Healthy Minds, Flourishing Lives: A Philosophical Approach to Mental Health a...Healthy Minds, Flourishing Lives: A Philosophical Approach to Mental Health a...
Healthy Minds, Flourishing Lives: A Philosophical Approach to Mental Health a...Osopher
 
An Overview of the Calendar App in Odoo 17 ERP
An Overview of the Calendar App in Odoo 17 ERPAn Overview of the Calendar App in Odoo 17 ERP
An Overview of the Calendar App in Odoo 17 ERPCeline George
 
Congestive Cardiac Failure..presentation
Congestive Cardiac Failure..presentationCongestive Cardiac Failure..presentation
Congestive Cardiac Failure..presentationdeepaannamalai16
 
Comparative Literature in India by Amiya dev.pptx
Comparative Literature in India by Amiya dev.pptxComparative Literature in India by Amiya dev.pptx
Comparative Literature in India by Amiya dev.pptxAvaniJani1
 

Kürzlich hochgeladen (20)

Team Lead Succeed – Helping you and your team achieve high-performance teamwo...
Team Lead Succeed – Helping you and your team achieve high-performance teamwo...Team Lead Succeed – Helping you and your team achieve high-performance teamwo...
Team Lead Succeed – Helping you and your team achieve high-performance teamwo...
 
Transaction Management in Database Management System
Transaction Management in Database Management SystemTransaction Management in Database Management System
Transaction Management in Database Management System
 
Integumentary System SMP B. Pharm Sem I.ppt
Integumentary System SMP B. Pharm Sem I.pptIntegumentary System SMP B. Pharm Sem I.ppt
Integumentary System SMP B. Pharm Sem I.ppt
 
ClimART Action | eTwinning Project
ClimART Action    |    eTwinning ProjectClimART Action    |    eTwinning Project
ClimART Action | eTwinning Project
 
BÀI TẬP BỔ TRỢ TIẾNG ANH 11 THEO ĐƠN VỊ BÀI HỌC - CẢ NĂM - CÓ FILE NGHE (GLOB...
BÀI TẬP BỔ TRỢ TIẾNG ANH 11 THEO ĐƠN VỊ BÀI HỌC - CẢ NĂM - CÓ FILE NGHE (GLOB...BÀI TẬP BỔ TRỢ TIẾNG ANH 11 THEO ĐƠN VỊ BÀI HỌC - CẢ NĂM - CÓ FILE NGHE (GLOB...
BÀI TẬP BỔ TRỢ TIẾNG ANH 11 THEO ĐƠN VỊ BÀI HỌC - CẢ NĂM - CÓ FILE NGHE (GLOB...
 
CLASSIFICATION OF ANTI - CANCER DRUGS.pptx
CLASSIFICATION OF ANTI - CANCER DRUGS.pptxCLASSIFICATION OF ANTI - CANCER DRUGS.pptx
CLASSIFICATION OF ANTI - CANCER DRUGS.pptx
 
Mattingly "AI & Prompt Design" - Introduction to Machine Learning"
Mattingly "AI & Prompt Design" - Introduction to Machine Learning"Mattingly "AI & Prompt Design" - Introduction to Machine Learning"
Mattingly "AI & Prompt Design" - Introduction to Machine Learning"
 
BIOCHEMISTRY-CARBOHYDRATE METABOLISM CHAPTER 2.pptx
BIOCHEMISTRY-CARBOHYDRATE METABOLISM CHAPTER 2.pptxBIOCHEMISTRY-CARBOHYDRATE METABOLISM CHAPTER 2.pptx
BIOCHEMISTRY-CARBOHYDRATE METABOLISM CHAPTER 2.pptx
 
Beauty Amidst the Bytes_ Unearthing Unexpected Advantages of the Digital Wast...
Beauty Amidst the Bytes_ Unearthing Unexpected Advantages of the Digital Wast...Beauty Amidst the Bytes_ Unearthing Unexpected Advantages of the Digital Wast...
Beauty Amidst the Bytes_ Unearthing Unexpected Advantages of the Digital Wast...
 
Oppenheimer Film Discussion for Philosophy and Film
Oppenheimer Film Discussion for Philosophy and FilmOppenheimer Film Discussion for Philosophy and Film
Oppenheimer Film Discussion for Philosophy and Film
 
4.9.24 Social Capital and Social Exclusion.pptx
4.9.24 Social Capital and Social Exclusion.pptx4.9.24 Social Capital and Social Exclusion.pptx
4.9.24 Social Capital and Social Exclusion.pptx
 
4.9.24 School Desegregation in Boston.pptx
4.9.24 School Desegregation in Boston.pptx4.9.24 School Desegregation in Boston.pptx
4.9.24 School Desegregation in Boston.pptx
 
Chi-Square Test Non Parametric Test Categorical Variable
Chi-Square Test Non Parametric Test Categorical VariableChi-Square Test Non Parametric Test Categorical Variable
Chi-Square Test Non Parametric Test Categorical Variable
 
Employablity presentation and Future Career Plan.pptx
Employablity presentation and Future Career Plan.pptxEmployablity presentation and Future Career Plan.pptx
Employablity presentation and Future Career Plan.pptx
 
Sulphonamides, mechanisms and their uses
Sulphonamides, mechanisms and their usesSulphonamides, mechanisms and their uses
Sulphonamides, mechanisms and their uses
 
Decoding the Tweet _ Practical Criticism in the Age of Hashtag.pptx
Decoding the Tweet _ Practical Criticism in the Age of Hashtag.pptxDecoding the Tweet _ Practical Criticism in the Age of Hashtag.pptx
Decoding the Tweet _ Practical Criticism in the Age of Hashtag.pptx
 
Healthy Minds, Flourishing Lives: A Philosophical Approach to Mental Health a...
Healthy Minds, Flourishing Lives: A Philosophical Approach to Mental Health a...Healthy Minds, Flourishing Lives: A Philosophical Approach to Mental Health a...
Healthy Minds, Flourishing Lives: A Philosophical Approach to Mental Health a...
 
An Overview of the Calendar App in Odoo 17 ERP
An Overview of the Calendar App in Odoo 17 ERPAn Overview of the Calendar App in Odoo 17 ERP
An Overview of the Calendar App in Odoo 17 ERP
 
Congestive Cardiac Failure..presentation
Congestive Cardiac Failure..presentationCongestive Cardiac Failure..presentation
Congestive Cardiac Failure..presentation
 
Comparative Literature in India by Amiya dev.pptx
Comparative Literature in India by Amiya dev.pptxComparative Literature in India by Amiya dev.pptx
Comparative Literature in India by Amiya dev.pptx
 

Introduction to Statistical Machine Learning

  • 1. Introduction to Statistical Machine Learning -1- Marcus Hutter Introduction to Statistical Machine Learning Marcus Hutter Canberra, ACT, 0200, Australia Machine Learning Summer School MLSS-2008, 2 – 15 March, Kioloa ANU RSISE NICTA
  • 2. Introduction to Statistical Machine Learning -2- Marcus Hutter Abstract This course provides a broad introduction to the methods and practice of statistical machine learning, which is concerned with the development of algorithms and techniques that learn from observed data by constructing stochastic models that can be used for making predictions and decisions. Topics covered include Bayesian inference and maximum likelihood modeling; regression, classification, density estimation, clustering, principal component analysis; parametric, semi-parametric, and non-parametric models; basis functions, neural networks, kernel methods, and graphical models; deterministic and stochastic optimization; overfitting, regularization, and validation.
  • 3. Introduction to Statistical Machine Learning -3- Marcus Hutter Table of Contents 1. Introduction / Overview / Preliminaries 2. Linear Methods for Regression 3. Nonlinear Methods for Regression 4. Model Assessment & Selection 5. Large Problems 6. Unsupervised Learning 7. Sequential & (Re)Active Settings 8. Summary
  • 4. Intro/Overview/Preliminaries -4- Marcus Hutter 1 INTRO/OVERVIEW/PRELIMINARIES • What is Machine Learning? Why Learn? • Related Fields • Applications of Machine Learning • Supervised↔Unsupervised↔Reinforcement Learning • Dichotomies in Machine Learning • Mini-Introduction to Probabilities
  • 5. Intro/Overview/Preliminaries -5- Marcus Hutter What is Machine Learning? Machine Learning is concerned with the development of algorithms and techniques that allow computers to learn Learning in this context is the process of gaining understanding by constructing models of observed data with the intention to use them for prediction. Related fields • Artificial Intelligence: smart algorithms • Statistics: inference from a sample • Data Mining: searching through large volumes of data • Computer Science: efficient algorithms and complex models
  • 6. Intro/Overview/Preliminaries -6- Marcus Hutter Why ’Learn’ ? There is no need to “learn” to calculate payroll Learning is used when: • Human expertise does not exist (navigating on Mars), • Humans are unable to explain their expertise (speech recognition) • Solution changes in time (routing on a computer network) • Solution needs to be adapted to particular cases (user biometrics) Example: It is easier to write a program that learns to play checkers or backgammon well by self-play rather than converting the expertise of a master player to a program.
  • 7. Intro/Overview/Preliminaries -7- Marcus Hutter Handwritten Character Recognition an example of a difficult machine learning problem Task: Learn general mapping from pixel images to digits from examples
  • 8. Intro/Overview/Preliminaries -8- Marcus Hutter Applications of Machine Learning machine learning has a wide spectrum of applications including: • natural language processing, • search engines, • medical diagnosis, • detecting credit card fraud, • stock market analysis, • bio-informatics, e.g. classifying DNA sequences, • speech and handwriting recognition, • object recognition in computer vision, • playing games – learning by self-play: Checkers, Backgammon. • robot locomotion.
  • 9. Intro/Overview/Preliminaries -9- Marcus Hutter Some Fundamental Types of Learning • Supervised Learning • Reinforcement Learning Classification Agents Regression • Unsupervised Learning • Others Association SemiSupervised Learning Clustering Active Learning Density Estimation
  • 10. Intro/Overview/Preliminaries - 10 - Marcus Hutter Supervised Learning • Prediction of future cases: Use the rule to predict the output for future inputs • Knowledge extraction: The rule is easy to understand • Compression: The rule is simpler than the data it explains • Outlier detection: Exceptions that are not covered by the rule, e.g., fraud
  • 11. Intro/Overview/Preliminaries - 11 - Marcus Hutter Classification Example: Credit scoring Differentiating between low-risk and high-risk customers from their Income and Savings Discriminant: IF income > θ1 AND savings > θ2 THEN low-risk ELSE high-risk
  • 12. Intro/Overview/Preliminaries - 12 - Marcus Hutter Regression Example: Price y = f (x)+noise of a used car as function of age x
  • 13. Intro/Overview/Preliminaries - 13 - Marcus Hutter Unsupervised Learning • Learning “what normally happens” • No output • Clustering: Grouping similar instances • Example applications: Customer segmentation in CRM Image compression: Color quantization Bioinformatics: Learning motifs
  • 14. Intro/Overview/Preliminaries - 14 - Marcus Hutter Reinforcement Learning • Learning a policy: A sequence of outputs • No supervised output but delayed reward • Credit assignment problem • Game playing • Robot in a maze • Multiple agents, partial observability, ...
  • 15. Intro/Overview/Preliminaries - 15 - Marcus Hutter Dichotomies in Machine Learning scope of my lecture ⇔ scope of other lectures (machine) learning / statistical ⇔ logic/knowledge-based (GOFAI) induction ⇔ prediction ⇔ decision ⇔ action regression ⇔ classification independent identically distributed ⇔ sequential / non-iid online learning ⇔ offline/batch learning passive prediction ⇔ active learning parametric ⇔ non-parametric conceptual/mathematical ⇔ computational issues exact/principled ⇔ heuristic supervised learning ⇔ unsupervised ⇔ RL learning
  • 16. Intro/Overview/Preliminaries - 16 - Marcus Hutter Probability Basics Probability is used to describe uncertain events; the chance or belief that something is or will be true. Example: Fair Six-Sided Die: • Sample space: Ω = {1, 2, 3, 4, 5, 6} • Events: Even= {2, 4, 6}, Odd= {1, 3, 5} ⊆Ω • Probability: P(6) = 1 , P(Even) = P(Odd) = 6 1 2 • Outcome: 6 ∈ E. P (6and Even) 1/6 1 • Conditional probability: P (6|Even) = = = P (Even) 1/2 3 General Axioms: • P({}) = 0 ≤ P(A) ≤ 1 = P(Ω), • P(A ∪ B) + P(A ∩ B) = P(A) + P(B), • P(A ∩ B) = P(A|B)P(B).
  • 17. Intro/Overview/Preliminaries - 17 - Marcus Hutter Probability Jargon Example: (Un)fair coin: Ω = {Tail,Head} {0, 1}. P(1) = θ ∈ [0, 1]: Likelihood: P(1101|θ) = θ × θ × (1 − θ) × θ ˆ Maximum Likelihood (ML) estimate: θ = arg maxθ P(1101|θ) = 3 4 Prior: If we are indifferent, then P(θ) =const. 1 Evidence: P(1101) = θ P(1101|θ)P(θ) = 20 (actually ) P(1101|θ)P(θ) Posterior: P(θ|1101) = P(1101) ∝ θ3 (1 − θ) (BAYES RULE!). ˆ Maximum a Posterior (MAP) estimate: θ = arg maxθ P(θ|1101) = 3 4 P(11011) 2 Predictive distribution: P(1|1101) = P(1101) = 3 2 Expectation: E[f |...] = θ f (θ)P(θ|...), e.g. E[θ|1101] = 3 2 Variance: Var(θ) = E[(θ − Eθ)2 |1101] = 63 Probability density: P(θ) = 1 P([θ, θ + ε]) for ε → 0 ε
  • 18. Linear Methods for Regression - 18 - Marcus Hutter 2 LINEAR METHODS FOR REGRESSION • Linear Regression • Coefficient Subset Selection • Coefficient Shrinkage • Linear Methods for Classifiction • Linear Basis Function Regression (LBFR) • Piecewise linear, Splines, Wavelets • Local Smoothing & Kernel Regression • Regularization & 1D Smoothing Splines
  • 19. Linear Methods for Regression - 19 - Marcus Hutter Linear Regression fitting a linear function to the data • Input “feature” vector x := (1 ≡ x(0) , x(1) , ..., x(d) ) ∈ I d+1 R • Real-valued noisy response y ∈ I R. • Linear regression model: y = fw (x) = w0 x(0) + ... + wd x(d) ˆ • Data: D = (x1 , y1 ), ..., (xn , yn ) • Error or loss function: Example: Residual sum of squares: n Loss(w) = i=1 (yi − fw (xi ))2 • Least squares (LSQ) regression: w = arg minw Loss(w) ˆ • Example: Person’s weight y as a function of age x1 , height x2 .
  • 20. Linear Methods for Regression - 20 - Marcus Hutter Coefficient Subset Selection Problems with least squares regression if d is large: • Overfitting: The plane fits the data well (perfect for d ≥ n), but predicts (generalizes) badly. • Interpretation: We want to identify a small subset of features important/relevant for predicting y. Solution 1: Subset selection: Take those k out of d features that minimize the LSQ error.
  • 21. Linear Methods for Regression - 21 - Marcus Hutter Coefficient Shrinkage Solution 2: Shrinkage methods: Shrink the least squares w by penalizing the Loss: Ridge regression: Add ∝ ||w||2 . 2 Lasso: Add ∝ ||w||1 . Bayesian linear regression: Comp. MAP arg maxw P(w|D) from prior P (w) and sampling model P (D|w). Weights of low variance components shrink most.
  • 22. Linear Methods for Regression - 22 - Marcus Hutter Linear Methods for Classification Example: Y = {spam,non-spam} {−1, 1} (or {0, 1}) Reduction to regression: Regard y ∈ I ⇒ w from linear regression. R ˆ Binary classification: If fw (x) > 0 then y = 1 else y = −1. ˆ ˆ ˆ Probabilistic classification: Predict probability that new x is in class y. P(y=1|x,D) Log-odds log P(y=0|x,D) := fw (x) ˆ Improvements: • Linear Discriminant Analysis (LDA) • Logistic Regression • Perceptron • Maximum Margin Hyperplane • Support Vector Machine Generalization to non-binary Y possible.
  • 23. Linear Methods for Regression - 23 - Marcus Hutter Linear Basis Function Regression (LBFR) = powerful generalization of and reduction to linear regression! Problem: Response y ∈ I is often not linear in x ∈ I d . R R Solution: Transform x ; φ(x) with φ : I d → I p . R R Assume/hope response y ∈ I is now linear in φ. R Examples: • Linear regression: p = d and φi (x) = xi . • Polynomial regression: d = 1 and φi (x) = xi . • Piecewise constant regression: E.g. d = 1 with φi (x) = 1 for i ≤ x < i + 1 and 0 else. • Piecewise polynomials ... • Splines ...
  • 24. Linear Methods for Regression - 24 - Marcus Hutter
  • 25. Linear Methods for Regression - 25 - Marcus Hutter 2D Spline LBFR and 1D Symmlet-8 Wavelets
  • 26. Linear Methods for Regression - 26 - Marcus Hutter Local Smoothing & Kernel Regression Estimate f (x) by averaging the yi for all xi a-close to x: Pn ˆ f (x) = Pn K(x,xi )yi i=1 K(x,xi ) i=1 Generalization to other K, Nearest-Neighbor Kernel: e.g. quadratic (Epanechnikov) Kernel: K(x, xi ) = 1 if { |x−xi |<a and 0 else K(x, xi ) = max{0, a2 − (x − xi )2 }
  • 27. Linear Methods for Regression - 27 - Marcus Hutter Regularization & 1D Smoothing Splines to avoid overfitting if function class is large ˆ = arg minf { n (yi −f (xi ))2 + λ (f (x))2 dx} f i=1 • λ=0 ˆ ⇒ f =any function through data • λ=∞ ˆ ⇒ f =least squares line fit • 0<λ<∞ ˆ ⇒ f =piecwise cubic with continuous derivative
  • 28. Nonlinear Regression - 28 - Marcus Hutter 3 NONLINEAR REGRESSION • Artificial Neural Networks • Kernel Trick • Maximum Margin Classifier • Sparse Kernel Methods / SVMs
  • 29. Nonlinear Regression - 29 - Marcus Hutter Artificial Neural Networks 1 as non-linear function approximator Single hidden layer feed- forward neural network d (1) • Hidden layer: zj = h( i=0 wji xi ) M (2) • Output: fw,k (x) = σ( j=0 wkj zj ) • Sigmoidal activation functions: h() and σ() ⇒ f non-linear • Goal: Find network weights best modeling the data: • Back-propagation algorithm: n Minimize i=1 ||y i − f w (xi )||2 2 w.r.t. w by gradient decent.
  • 30. Nonlinear Regression - 30 - Marcus Hutter Artificial Neural Networks 2 • Avoid overfitting by early stopping or small M . • Avoid local minima by random initial weights or stochastic gradient descent. Example: Image Processing
  • 31. Nonlinear Regression - 31 - Marcus Hutter Kernel Trick The Kernel-Trick allows to reduce the functional minimization to finite-dimensional optimization. • Let L(y, f (x)) be any loss function • and J(f ) be a penalty quadratic in f . n • then minimum of penalized loss i=1 L(yi , f (xi )) + λJ(f ) n • has form f (x) = i=1 αi K(x, xi ) n • with α minimizing i=1 L(yi , (Kα)i ) + λα Kα. • and Kernel K ij = K(xi , xj ) following from J.
  • 32. Nonlinear Regression - 32 - Marcus Hutter Maximum Margin Classifier ˆ w = arg max min{yi (w φ(xi ))} with y ∈ {−1, 1} w:||w||=1 i • Linear boundary for φb (x) = x(b) . • Boundary is determined by Support Vectors (circled data points) • Margin negative if classes not separable.
  • 33. Nonlinear Regression - 33 - Marcus Hutter Sparse Kernel Methods / SVMs Non-linear boundary for general φb (x) n w= ˆ i=1 ai φ(xi ) for some a. ˆ n ⇒ f (x) = w φ(x) = i=1 ai K(xi , x) ˆ depends only on φ via Kernel d K(xi , x) = b=1 φb (xi )φb (x). ⇒ Huge time savings if d n Example K(x, x ): • polynomial (1 + x, x )d , • Gaussian exp(−||x − x ||2 ), 2 • neural network tanh( x, x ).
  • 34. Model Assessment & Selection - 34 - Marcus Hutter 4 MODEL ASSESSMENT & SELECTION • Example: Polynomial Regression • Training=Empirical versus Test=True Error • Empirical Model Selection • Theoretical Model Selection • The Loss Rank Principle for Model Selection
  • 35. Model Assessment & Selection - 35 - Marcus Hutter Example: Polynomial Regression • Straight line does not fit data well (large training error) high bias ⇒ poor predictive performance • High order polynomial fits data perfectly (zero training error) high variance (overfitting) ⇒ poor prediction too! • Reasonable polynomial degree d performs well. How to select d? minimizing training error obviously does not work.
  • 36. Model Assessment & Selection - 36 - Marcus Hutter Training=Empirical versus Test=True Error • Learn functional relation f for data D = {(x1 , y1 ), ..., (xn , yn )}. • We can compute the empirical error on past data: 1 n ErrD (f ) = n i=1 (yi − f (xi ))2 . • Assume data D is sample from some distribution P. • We want to know the expected true error on future examples: ErrP (f ) = EP [(y − f (x))]. • How good an estimate of ErrP (f ) is ErrD (f ) ? • Problem: ErrD (f ) decreases with increasing model complexity, but not ErrP (f ).
  • 37. Model Assessment & Selection - 37 - Marcus Hutter Empirical Model Selection How to select complexity parameter • Kernel width a, • penalization constant λ, • number k of nearest neighbors, • the polynomial degree d? Empirical test-set-based methods: Regress on training set and mini- mize empirical error w.r.t. “com- plexity” parameter (a, λ, k, d) on a separate test-set. Sophistication: cross-validation, bootstrap, ...
  • 38. Model Assessment & Selection - 38 - Marcus Hutter Theoretical Model Selection How to select complexity or flexibility or smoothness parameter: Kernel width a, penalization constant λ, number k of nearest neighbors, the polynomial degree d? For parametric regression with d parameters: • Bayesian model selection, • Akaike Information Criterion (AIC), • Bayesian Information Criterion (BIC), • Minimum Description Length (MDL), They all add a penalty proportional to d to the loss. For non-parametric linear regression: • Add trace of on-data regressor = effective # of parameters to loss. • Loss Rank Principle (LoRP).
  • 39. Model Assessment & Selection - 39 - Marcus Hutter The Loss Rank Principle for Model Selection ˆc Let fD : X → Y be the (best) regressor of complexity c on data D. ˆc The loss Rank of fD is defined as the number of other (fictitious) data ˆc ˆc D that are fitted better by fD than D is fitted by fD . ˆc • c is small ⇒ fD fits D badly ⇒ many other D can be fitted better ⇒ Rank is large. • c is large ⇒ many D can be fitted well ⇒ Rank is large. ˆc • c is appropriate ⇒ fD fits D well and not too many other D can be fitted well ⇒ Rank is small. LoRP: Select model complexity c that has minimal loss Rank Unlike most penalized maximum likelihood variants (AIC,BIC,MDL), • LoRP only depends on the regression and the loss function. • It works without a stochastic noise model, and • is directly applicable to any non-parametric regressor, like kNN.
  • 40. How to Attack Large Problems - 40 - Marcus Hutter 5 HOW TO ATTACK LARGE PROBLEMS • Probabilistic Graphical Models (PGM) • Trees Models • Non-Parametric Learning • Approximate (Variational) Inference • Sampling Methods • Combining Models • Boosting
  • 41. How to Attack Large Problems - 41 - Marcus Hutter Probabilistic Graphical Models (PGM) Visualize structure of model and (in)dependence ⇒ faster and more comprehensible algorithms • Nodes = random variables. • Edges = stochastic dependencies. • Bayesian network = directed PGM • Markov random field = undirected PGM Example: P(x1 )P(x2 )P(x3 )P(x4 |x1 x2 x3 )P(x5 |x1 x3 )P (x6 |x4 )P(x7 |x4 x5 )
  • 42. How to Attack Large Problems - 42 - Marcus Hutter Additive Models & Trees & Related Methods Generalized additive model: f (x) = α + f1 (x1 ) + ... + fd (xd ) Reduces determining f : I d → I to d 1d functions fb : I → I R R R R Classification/decision trees: Outlook: • PRIM, • bump hunting, • How to learn tree structures.
  • 43. How to Attack Large Problems - 43 - Marcus Hutter Regression Trees f (x) = cb for x ∈ Rb , and cb = Average[yi |xi ∈ Rb ]
  • 44. How to Attack Large Problems - 44 - Marcus Hutter Non-Parametric Learning = prototype methods = instance-based learning = model-free Examples: • K-means: Data clusters around K centers with cluster means µ1 , ...µK . Assign xi to closest cluster center. • K Nearest neighbors regression (kNN): Estimate f (x) by averaging the yi for the k xi closest to x: • Kernel regression: n ˆ i=1 K(x, xi )yi Take a weighted average f (x) = n . i=1 K(x, xi )
  • 45. How to Attack Large Problems - 45 - Marcus Hutter
  • 46. How to Attack Large Problems - 46 - Marcus Hutter Approximate (Variational) Inference Approximate full distribution P(z) by q(z) Examples: Gaussians Popular: Factorized distribution: q(z) = q1 (z1 ) × ... × qM (zM ). Measure of fit: Relative entropy KL(p||q) = p(z) log p(z) dz q(z) Red curves: Left min- imizes KL(P||q), Mid- dle and Right are the two local minima of KL(q||P).
  • 47. How to Attack Large Problems - 47 - Marcus Hutter Elementary Sampling Methods How to sample from P : Z → [0, 1]? • Special sampling algorithms for standard distributions P. • Rejection sampling: Sample z uniformly from domain Z, but accept sample only with probability ∝ P(z). • Importance sampling: 1 L E[f ] = f (z)p(z)dz L l=1 f (z l )p(z l )/q(z l ), where z l are sampled from q. Choose q easy to sample and large where f (z l )p(z l ) is large. • Others: Slice sampling
  • 48. How to Attack Large Problems - 48 - Marcus Hutter Markov Chain Monte Carlo (MCMC) Sampling Metropolis: Choose some conve- nient q with q(z|z ) = q(z |z). Sample z l+1 from q(·|z l ) but accept only with probability min{1, p(zl+1 )/p(z l )}. Gibbs Sampling: Metropolis with q leaving z unchanged from l ; l + 1,except resample coordinate i from P(zi |z i ). Green lines are accepted and red lines are rejected Metropolis steps.
  • 49. How to Attack Large Problems - 49 - Marcus Hutter Combining Models Performance can often be improved by combining multiple models in some way, instead of just using a single model in isolation • Committees: Average the predictions of a set of individual models. • Bayesian model averaging: P(x) = Models P(x|Model)P(Model) • Mixture models: P(x|θ, π) = k πk Pk (x|θk ) • Decision tree: Each model is responsible for a different part of the input space. • Boosting: Train many weak classifiers in sequence and combine output to produce a powerful committee.
  • 50. How to Attack Large Problems - 50 - Marcus Hutter Boosting Idea: Train many weak classifiers Gm in sequence and combine output to produce a powerful committee G. AdaBoost.M1: [Freund & Schapire (1997) received famous G¨del-Prize] o Initialize observation weights wi uniformly. For m = 1 to M do: (a) Gm classifies x as Gm (x) ∈ {−1, 1}. Train Gm weighing data i with wi . (b) Give Gm high/low weight αi if it performed well/bad. (c) Increase attention=weight wi for obs. xi misclassified by Gm . M Output weighted majority vote: G(x) = sign( m=1 αm Gm (x))
  • 51. Unsupervised Learning - 51 - Marcus Hutter 6 UNSUPERVISED LEARNING • K-Means Clustering • Mixture Models • Expectation Maximization Algorithm
  • 52. Unsupervised Learning - 52 - Marcus Hutter Unsupervised Learning Supervised learning: Find functional relation- ship f : X → Y from I/O data {(xi , yi )} Unsupervised learning: Find pattern in data (x1 , ..., xn ) without an explicit teacher (no y values). Example: Clustering e.g. by K-means Implicit goal: Find simple explanation, i.e. compress data (MDL, Occam’s razor). Density estimation: From which probability distribution P are the xi drawn?
  • 53. Unsupervised Learning - 53 - Marcus Hutter K-means Clustering and EM • Data points seem to cluster around two centers. • Assign each data point i to a cluster ki . • Let µk = center of cluster k. • Distortion measure: Total distance2 of data points from cluster centers: n J(k, µ) := i=1 ||xi − µki ||2 • Choose centers µk initially at random. • M-step: Minimize J w.r.t. k: Assign each point to closest center • E-step: Minimize J w.r.t. µ: Let µk be the mean of points belonging to cluster k • Iteration of M-step and E-step converges to local minimum of J.
  • 54. Unsupervised Learning - 54 - Marcus Hutter Iterations of K-means EM Algorithm
  • 55. Unsupervised Learning - 55 - Marcus Hutter Mixture Models and EM Mixture of Gaussians: K P(x|πµΣ) = i=1 Gauss(x|µk , Σk )πk Maximize likelihood P(x|...) w.r.t. π, µ, Σ. E-Step: Compute proba- bility γik that data point i belongs to cluster k, based on estimates π, µ, Σ. M-Step: Re-estimate π, µ, Σ (take empirical mean/variance) given γik .
  • 56. Non-IID: Sequential & (Re)Active Settings - 56 - Marcus Hutter 7 NON-IID: SEQUENTIAL & (RE)ACTIVE SETTINGS • Sequential Data • Sequential Decision Theory • Learning Agents • Reinforcement Learning (RL) • Learning in Games
  • 57. Non-IID: Sequential & (Re)Active Settings - 57 - Marcus Hutter Sequential Data n General stochastic time-series: P(x1 ...xn ) = i=1 P(xi |x1 ...xi−1 ) Independent identically distributed (roulette,dice,classification): P(x1 ...xn ) = P(x1 )P(x2 )...P(xn ) First order Markov chain (Backgammon): P(x1 ...xn ) = P(x1 )P(x2 |x1 )P(x3 |x2 )...P(xn |xn−1 ) Second order Markov chain (mechanical systems): P(x1 ...xn ) = P(x1 )P(x2 |x1 )P(x3 |x1 x2 )× ... × P(xn |xn−1 xn−2 ) Hidden Markov model (speech recognition): P(x1 ...xn ) = P(z1 )P(z2 |z1 )...P(zn |zn−1 ) n × i=1 P(xi |zi )dz
  • 58. Non-IID: Sequential & (Re)Active Settings - 58 - Marcus Hutter Sequential Decision Theory Setup: Example: Weather Forecasting For t = 1, 2, 3, 4, ... xt ∈ X = {sunny, rainy} Given sequence x1 , x2 , ..., xt−1 yt ∈ Y = {umbrella, sunglasses} (1) predict/make decision yt , Loss sunny rainy (2) observe xt , umbrella 0.1 0.3 (3) suffer loss Loss(xt , yt ), (4) t → t + 1, goto (1) sunglasses 0.0 1.0 Goal: Minimize expected Loss: yt = arg minyt E[Loss(xt , yt )|x1 ...xt−1 ] ˆ Greedy minimization of expected loss is optimal if: Important: Decision yt does not influence env. (future observations). Loss = square / absolute / 0-1 error function Examples: y ˆ = mean / median / mode of P(xt | · · · )
  • 59. Non-IID: Sequential & (Re)Active Settings - 59 - Marcus Hutter Learning Agents 1 sensors percepts ? environment agent actions actuators Additional complication: Learner can influence environment, and hence what he observes next. ⇒ farsightedness, planning, exploration necessary. Exploration ⇔ Exploitation problem
  • 60. Non-IID: Sequential & (Re)Active Settings - 60 - Marcus Hutter Learning Agents 2 Performance standard Critic Sensors feedback Environment changes Learning Performance element element knowledge learning goals Problem generator Agent Actuators
  • 61. Non-IID: Sequential & (Re)Active Settings - 61 - Marcus Hutter Reinforcement Learning (RL) r1 | o1 r2 | o2 r3 | o3 r4 | o4 r5 | o5 r6 | o6 ... ¨¨ ‰ r rr ¨ % ¨ r Agent Environ- p ment q €€ I € €€ € q y1 y2 y3 y4 y5 y6 ... • RL is concerned with how an agent ought to take actions in an environment so as to maximize its long-term reward. • Find policy that maps states of the world to the actions the agent ought to take in those states. • The environment is typically formulated as a finite-state Markov decision process (MDP).
  • 62. Non-IID: Sequential (Re)Active Settings - 62 - Marcus Hutter Learning in Games • Learning though self-play. • Backgammon (TD-Gammon). • Samuel’s checker program.
  • 63. Summary - 63 - Marcus Hutter 8 SUMMARY • Important Loss Functions • Learning Algorithm Characteristics Comparison • More Learning • Data Sets • Journals • Annual Conferences • Recommended Literature
  • 64. Summary - 64 - Marcus Hutter Important Loss Functions
  • 65. Summary - 65 - Marcus Hutter Learning Algorithm Characteristics Comparison
  • 66. Summary - 66 - Marcus Hutter More Learning • Concept Learning • Bayesian Learning • Computational Learning Theory (PAC learning) • Genetic Algorithms • Learning Sets of Rules • Analytical Learning
  • 67. Summary - 67 - Marcus Hutter Data Sets • UCI Repository: http://www.ics.uci.edu/ mlearn/MLRepository.html • UCI KDD Archive: http://kdd.ics.uci.edu/summary.data.application.html • Statlib: http://lib.stat.cmu.edu/ • Delve: http://www.cs.utoronto.ca/ delve/
  • 68. Summary - 68 - Marcus Hutter Journals • Journal of Machine Learning Research • Machine Learning • IEEE Transactions on Pattern Analysis and Machine Intelligence • Neural Computation • Neural Networks • IEEE Transactions on Neural Networks • Annals of Statistics • Journal of the American Statistical Association • ...
  • 69. Summary - 69 - Marcus Hutter Annual Conferences • Algorithmic Learning Theory (ALT) • Computational Learning Theory (COLT) • Uncertainty in Artificial Intelligence (UAI) • Neural Information Processing Systems (NIPS) • European Conference on Machine Learning (ECML) • International Conference on Machine Learning (ICML) • International Joint Conference on Artificial Intelligence (IJCAI) • International Conference on Artificial Neural Networks (ICANN)
  • 70. Summary - 70 - Marcus Hutter Recommended Literature [Bis06] C. M. Bishop. Pattern Recognition and Machine Learning. Springer, 2006. [HTF01] T. Hastie, R. Tibshirani, and J. H. Friedman. The Elements of Statistical Learning. Springer, 2001. [Alp04] E. Alpaydin. Introduction to Machine Learning. MIT Press, 2004. [Hut05] M. Hutter. Universal Artificial Intelligence: Sequential Decisions based on Algorithmic Probability. Springer, Berlin, 2005. http://www.hutter1.net/ai/uaibook.htm.