SlideShare ist ein Scribd-Unternehmen logo
1 von 90
Downloaden Sie, um offline zu lesen
Machine Learning
A Scientific Method or Just a Bag of Tools?
                        Don Hush


                  Machine Learning Team
        Group CCS-3, Los Alamos National Laboratory




                                           Los Alamos National Laboratory LAUR Number 06-2338 – p.1/30
Machine Learning Toolbox
  Fisher’s Linear Discriminant
  Nearest Neighbor
  Neural Networks (backprop)
  Decision Trees (CART, C4.5)
  Boosting
  Support Vector Machines
  K–Means Clustering
  Principle Component Analysis (PCA)
  Expectation-Maximization (EM)
  ... and many more

                                       Los Alamos National Laboratory LAUR Number 06-2338 – p.2/30
A Day at Work with the ML Toolbox

  Job Assignment: Design a system that uses the Tufts
  Artificial Nose to detect trichloroethylene (TCE).

  Tufts Data Collection:
     760 samples with TCE

     352 samples without TCE

  Tool: Support Vector Machine (SVM)




                                    Los Alamos National Laboratory LAUR Number 06-2338 – p.3/30
A Day at Work ...
   The SVM Tool:
     produces a classifier and
     reports a classification error rate of 18%




                                      Los Alamos National Laboratory LAUR Number 06-2338 – p.4/30
A Day at Work ...
   The SVM Tool:
     produces a classifier and
     reports a classification error rate of 18%
     However, when the classifier is deployed it
     produces an error rate of 38%




                                     Los Alamos National Laboratory LAUR Number 06-2338 – p.4/30
A Day at Work ...
   The SVM Tool:
     produces a classifier and
     reports a classification error rate of 18%
     However, when the classifier is deployed it
     produces an error rate of 38%
   One of the (many) reasons why this is unacceptable:
   The naive classifier, that predicts NOT-TCE for every
   sample, produces an error rate of 10%.




                                      Los Alamos National Laboratory LAUR Number 06-2338 – p.4/30
A Day at Work ...
   The SVM Tool:
     produces a classifier and
     reports a classification error rate of 18%
     However, when the classifier is deployed it
     produces an error rate of 38%
   One of the (many) reasons why this is unacceptable:
   The naive classifier, that predicts NOT-TCE for every
   sample, produces an error rate of 10%.
   What Went Wrong?



                                      Los Alamos National Laboratory LAUR Number 06-2338 – p.4/30
A Day at Work ...
   The SVM Tool:
     produces a classifier and
     reports a classification error rate of 18%
     However, when the classifier is deployed it
     produces an error rate of 38%
   One of the (many) reasons why this is unacceptable:
   The naive classifier, that predicts NOT-TCE for every
   sample, produces an error rate of 10%.
   What Went Wrong?
               READ THE MANUAL

                                      Los Alamos National Laboratory LAUR Number 06-2338 – p.4/30
The SVM Manual Entry
  SVMs ... assume that the operating environment is
  characterized by a stationary random process and that
  the training data is sampled from that process ...




                                    Los Alamos National Laboratory LAUR Number 06-2338 – p.5/30
The SVM Manual Entry
  SVMs ... assume that the operating environment is
  characterized by a stationary random process and that
  the training data is sampled from that process ...


   Since the fraction of TCE samples in the training
   data is 0.7 and the fraction on the operating environ-
   ment is 0.1, the TCE problem violates the assump-
   tions!




                                       Los Alamos National Laboratory LAUR Number 06-2338 – p.5/30
What Should We Do?


  tweak the SVM tool




                       Los Alamos National Laboratory LAUR Number 06-2338 – p.6/30
What Should We Do?


  tweak the SVM tool

  use a different tool from the toolbox




                                 Los Alamos National Laboratory LAUR Number 06-2338 – p.6/30
What Should We Do?


  tweak the SVM tool

  use a different tool from the toolbox

  design a tool specifically for the TCE problem




                                 Los Alamos National Laboratory LAUR Number 06-2338 – p.6/30
How do we design a new tool?




                   Los Alamos National Laboratory LAUR Number 06-2338 – p.7/30
Scientific Problem Formulation
Key Ingredients:
   Specify a performance criterion – a measure of the
   quality of the model




                                      Los Alamos National Laboratory LAUR Number 06-2338 – p.8/30
Scientific Problem Formulation
Key Ingredients:
   Specify a performance criterion – a measure of the
   quality of the model
   Identify and characterize the information available to
   design the model




                                        Los Alamos National Laboratory LAUR Number 06-2338 – p.8/30
Scientific Problem Formulation
Key Ingredients:
   Specify a performance criterion – a measure of the
   quality of the model
   Identify and characterize the information available to
   design the model ... two types
      Empirical (EMP), i.e. data
      First Principles Knowledge (FP)




                                        Los Alamos National Laboratory LAUR Number 06-2338 – p.8/30
Scientific Problem Formulation
Key Ingredients:
   Specify a performance criterion – a measure of the
   quality of the model
   Identify and characterize the information available to
   design the model
   Establish a validation procedure – a way to evaluate
   (or estimate) the performance of a proposed model




                                        Los Alamos National Laboratory LAUR Number 06-2338 – p.8/30
Scientific Problem Formulation
Key Ingredients:
   Specify a performance criterion – a measure of the
   quality of the model
   Identify and characterize the information available to
   design the model
   Establish a validation procedure – a way to evaluate
   (or estimate) the performance of a proposed model
   ... two methods
        Empirical Tests
        Theoretical Analysis

                                        Los Alamos National Laboratory LAUR Number 06-2338 – p.8/30
Scientific Problem Formulation
Key Ingredients:
    Specify a performance criterion – a measure of the
    quality of the model
    Identify and characterize the information available to
    design the model
    Establish a validation procedure – a way to evaluate
    (or estimate) the performance of a proposed model


All of this is done before we develop a solution method.



                                         Los Alamos National Laboratory LAUR Number 06-2338 – p.8/30
ML + Scientific Method
       



               ¡
The            Approach:
          £¢

1. Construct a scientific problem formulation.




                                      Los Alamos National Laboratory LAUR Number 06-2338 – p.9/30
ML + Scientific Method
       



               ¡
The            Approach:
          £¢

1. Construct a scientific problem formulation.
2. Determine its feasibility. If not feasible, go to step 1.




                                           Los Alamos National Laboratory LAUR Number 06-2338 – p.9/30
ML + Scientific Method
       



               ¡
The            Approach:
          £¢

1. Construct a scientific problem formulation.
2. Determine its feasibility. If not feasible, go to step 1.
3. If feasible then use any means necessary to determine
   a solution method that is guaranteed to be
          practical (e.g. computationally feasible), and
          provide good performance (e.g. near optimal)




                                            Los Alamos National Laboratory LAUR Number 06-2338 – p.9/30
ML + Scientific Method
        



                ¡
The             Approach:
           £¢

1. Construct a scientific problem formulation.
2. Determine its feasibility. If not feasible, go to step 1.
3. If feasible then use any means necessary to determine
   a solution method that is guaranteed to be
           practical (e.g. computationally feasible), and
           provide good performance (e.g. near optimal)
      (although obvious, very few tools are designed to
      provide such guarantees!)

                                             Los Alamos National Laboratory LAUR Number 06-2338 – p.9/30
An Example: Applying        to the




                                          
Supervised Classification Problem




                      Los Alamos National Laboratory LAUR Number 06-2338 – p.10/30
+ Supervised Classification
  
A model       assigns the label                           to data point .




                                          

                                                

                                                     
          ¤




                                                 ¤
                                     ¨¦
                                  ¥
                                  §©




                                              




                                                                                          
                                                 Los Alamos National Laboratory LAUR Number 06-2338 – p.11/30
+ Supervised Classification
  
A model assigns the label              to data point .




                                     

                                           

                                                
         ¤




                                            ¤
                                ¨¦
                            ¥
                             §©




                                         




                                                                                     
Performance Criterion: classification error rate,




                                                                             

                                                                                   
                                                                                 ¤
                                                                           
                                            Los Alamos National Laboratory LAUR Number 06-2338 – p.11/30
+ Supervised Classification
  
A model assigns the label                  to data point .




                                       

                                             

                                                  
         ¤




                                              ¤
                                  ¨¦
                              ¥
                               §©




                                           




                                                                                       
Performance Criterion: classification error rate,




                                                                               

                                                                                     
                                                                                   ¤
                                                                             
Information:
    First Principles: the operating environment is
    characterized by a stationary random process
    Empirical: (labeled) training data is sampled from
    that process




                                              Los Alamos National Laboratory LAUR Number 06-2338 – p.11/30
+ Supervised Classification
  
A model assigns the label                  to data point .




                                       

                                             

                                                  
         ¤




                                              ¤
                                  ¨¦
                              ¥
                               §©




                                           




                                                                                       
Performance Criterion: classification error rate,




                                                                               

                                                                                     
                                                                                   ¤
                                                                             
Information:
    First Principles: the operating environment is
    characterized by a stationary random process
    Empirical: (labeled) training data is sampled from
    that process
Validation:
    Empirical: hold-out, cross-validation, bootstrap



                                              Los Alamos National Laboratory LAUR Number 06-2338 – p.11/30
+ Supervised Classification
  
A model assigns the label                  to data point .




                                       

                                             

                                                  
         ¤




                                              ¤
                                  ¨¦
                              ¥
                               §©




                                           




                                                                                       
Performance Criterion: classification error rate,




                                                                               

                                                                                     
                                                                                   ¤
                                                                             
Information:
    First Principles: the operating environment is
    characterized by a stationary random process
    Empirical: (labeled) training data is sampled from
    that process
Validation:
    Empirical: hold-out, cross-validation, bootstrap
    allows us to compare methods, but does not tell us
    how close we are to optimal
                                              Los Alamos National Laboratory LAUR Number 06-2338 – p.11/30
+ Supervised Classification
  
A model assigns the label                  to data point .




                                       

                                             

                                                  
         ¤




                                              ¤
                                  ¨¦
                              ¥
                               §©




                                           




                                                                                       
Performance Criterion: classification error rate,




                                                                               

                                                                                     
                                                                                   ¤
                                                                             
Information:
    First Principles: the operating environment is
    characterized by a stationary random process
    Empirical: (labeled) training data is sampled from
    that process
Validation:
    Empirical: hold-out, cross-validation, bootstrap
    Theoretical: Statistics + Computer Science


                                              Los Alamos National Laboratory LAUR Number 06-2338 – p.11/30
+ Supervised Classification
  
A model assigns the label                  to data point .




                                       

                                             

                                                  
         ¤




                                              ¤
                                  ¨¦
                              ¥
                               §©




                                           




                                                                                       
Performance Criterion: classification error rate,




                                                                               

                                                                                     
                                                                                   ¤
                                                                             
Information:
    First Principles: the operating environment is
    characterized by a stationary random process
    Empirical: (labeled) training data is sampled from
    that process
Validation:
    Empirical: hold-out, cross-validation, bootstrap
    Theoretical: Statistics + Computer Science
    probably approximately correct (PAC) analysis
                                              Los Alamos National Laboratory LAUR Number 06-2338 – p.11/30
Support Vector Machines (SVMs)
PAC Result: With mild assumptions on the distribution the SVM with




                                                                                                                  
training samples requires




                                      $
                                




                                                  
                                

                                     # !

                                              %
                                    




                                                  
computation to produce a classifier                with performance




                                           ( )'
                      

                          12




                                                        9



                                                                    
                            '




                                                               B@
                                                  7
                                34




                                                  6
                      0




                                0



                                           5
                                              


                                                      8

                                                             A
                          (
where is the theoretical minimum error and the rate
       3




                                                                                     EC




                                                                                                     G
                                                                                   D


                                                                                            FD
      0




depends on the distribution.




                                                                    Los Alamos National Laboratory LAUR Number 06-2338 – p.12/30
Support Vector Machines (SVMs)
PAC Result: With mild assumptions on the distribution the SVM with




                                                                                                                   
training samples requires




                                       $
                                 




                                                   
                                 

                                      # !

                                               %
                                     




                                                   
computation to produce a classifier                 with performance




                                            ( )'
                        

                           12




                                                         9



                                                                     
                             '




                                                                B@
                                                   7
                                 34




                                                   6
                       0




                                 0



                                            5
                                               


                                                       8

                                                              A
                           (
where is the theoretical minimum error and the rate
       3




                                                                                      EC




                                                                                                      G
                                                                                    D


                                                                                             FD
      0




depends on the distribution.
Observation: This result addresses the major practical concerns:
    performance (of the actual classifier produced)
    computation (of the actual algorithm used)
    generality (applies to very large class of distributions)


                                                                     Los Alamos National Laboratory LAUR Number 06-2338 – p.12/30
Applying       to the TCE Problem

            




                      Los Alamos National Laboratory LAUR Number 06-2338 – p.13/30
+ TCE Problem
      
Scientific Problem Formulation: Same as supervised
classification except that




                                    Los Alamos National Laboratory LAUR Number 06-2338 – p.14/30
+ TCE Problem
      
Scientific Problem Formulation: Same as supervised
classification except that

   we assume a nonstationary random process because
   the fraction of time that TCE is present varies over the
   range       .
                   Q 
          I P H




                                       Los Alamos National Laboratory LAUR Number 06-2338 – p.14/30
+ TCE Problem
      
Scientific Problem Formulation: Same as supervised
classification except that

   we assume a nonstationary random process because
   the fraction of time that TCE is present varies over the
   range       .
                   Q 
          I P H




   the performance criterion is the error rate for the worst
   possible value in the range      .

                                        Q 
                               I P H
   (this is a min–max problem)



                                              Los Alamos National Laboratory LAUR Number 06-2338 – p.14/30
TCE Tool




           Los Alamos National Laboratory LAUR Number 06-2338 – p.15/30
Impacts of       on Data Driven


                 
             Modeling




                      Los Alamos National Laboratory LAUR Number 06-2338 – p.16/30
Impacts of                     on DDM




                            
   It has focused attention on Direct Solution Methods




                                             Los Alamos National Laboratory LAUR Number 06-2338 – p.17/30
Impacts of                    on DDM




                           
   It has focused attention on Direct Solution Methods
   Direct Method: choose a model that minimizes an empirical risk
   function that is calibrated with respect to the performance
   criterion.




                                            Los Alamos National Laboratory LAUR Number 06-2338 – p.17/30
Impacts of                   on DDM




                          
   It has focused attention on Direct Solution Methods
   Paradigm Shift?: replace maximum likelihood, maximum entropy,
   least squares, plug–in, etc. with calibrated empirical risk
   minimization




                                           Los Alamos National Laboratory LAUR Number 06-2338 – p.17/30
Impacts of                    on DDM




                           
   It has focused attention on Direct Solution Methods
   It has started a movement towards end–to–end learning




                                           Los Alamos National Laboratory LAUR Number 06-2338 – p.17/30
Impacts of                    on DDM




                           
   It has focused attention on Direct Solution Methods
   It has started a movement towards end–to–end learning
   Traditional Approach:




                                           Los Alamos National Laboratory LAUR Number 06-2338 – p.17/30
Impacts of                    on DDM




                           
   It has focused attention on Direct Solution Methods
   It has started a movement towards end–to–end learning
   Traditional Approach:




   New Approach:




                                           Los Alamos National Laboratory LAUR Number 06-2338 – p.17/30
Impacts of                     on DDM




                            
   It has focused attention on Direct Solution Methods
   It has started a movement towards end–to–end learning
   Traditional Approach:




   New Approach:
      push learning into earlier stages




                                           Los Alamos National Laboratory LAUR Number 06-2338 – p.17/30
Impacts of                     on DDM




                            
   It has focused attention on Direct Solution Methods
   It has started a movement towards end–to–end learning
   Traditional Approach:




   New Approach:
      push learning into earlier stages
      collapse last two stages into one using model classes that are




                                             Los Alamos National Laboratory LAUR Number 06-2338 – p.17/30
Impacts of                     on DDM




                            
   It has focused attention on Direct Solution Methods
   It has started a movement towards end–to–end learning
   Traditional Approach:




   New Approach:
      push learning into earlier stages
      collapse last two stages into one using model classes that are
         richer (e.g. higher dimensions)




                                             Los Alamos National Laboratory LAUR Number 06-2338 – p.17/30
Impacts of                     on DDM




                            
   It has focused attention on Direct Solution Methods
   It has started a movement towards end–to–end learning
   Traditional Approach:




   New Approach:
      push learning into earlier stages
      collapse last two stages into one using model classes that are
         richer (e.g. higher dimensions)
         Myth: dimensionality must be reduced to achieve good
         performance. example


                                             Los Alamos National Laboratory LAUR Number 06-2338 – p.17/30
Impacts of                     on DDM




                            
   It has focused attention on Direct Solution Methods
   It has started a movement towards end–to–end learning
   Traditional Approach:




   New Approach:
      push learning into earlier stages
      collapse last two stages into one using model classes that are
         richer (e.g. higher dimensions)
         more flexible (e.g. accommodates different data types)




                                             Los Alamos National Laboratory LAUR Number 06-2338 – p.17/30
Impacts of                     on DDM




                            
   It has focused attention on Direct Solution Methods
   It has started a movement towards end–to–end learning
   Traditional Approach:




   New Approach:
      push learning into earlier stages
      collapse last two stages into one using model classes that are
         richer (e.g. higher dimensions)
         more flexible (e.g. accommodates different data types)
         Myth: data must be mapped to       before we can build a
                                         S
                                        R
         model. example

                                             Los Alamos National Laboratory LAUR Number 06-2338 – p.17/30
Impacts of                     on DDM




                            
   It has focused attention on Direct Solution Methods
   It has started a movement towards end–to–end learning
   Traditional Approach:




   New Approach:
      push learning into earlier stages
      collapse last two stages into one using model classes that are
         richer (e.g. higher dimensions)
         more flexible (e.g. accommodates different data types)
         simply parameterized


                                             Los Alamos National Laboratory LAUR Number 06-2338 – p.17/30
Impacts of                     on DDM




                            
   It has focused attention on Direct Solution Methods
   It has started a movement towards end–to–end learning
   Traditional Approach:




   New Approach:
      push learning into earlier stages
      collapse last two stages into one using model classes that are
         richer (e.g. higher dimensions)
         more flexible (e.g. accommodates different data types)
         simply parameterized
      Example: Kernel Machines

                                             Los Alamos National Laboratory LAUR Number 06-2338 – p.17/30
Impacts of                     on DDM




                            
   It has focused attention on Direct Solution Methods
   It has started a movement towards end–to–end learning
   Traditional Approach:




   New Approach:
      push learning into earlier stages
      collapse last two stages into one using model classes that are
         richer (e.g. higher dimensions)
         more flexible (e.g. accommodates different data types)
         simply parameterized
   Paradigm Shift?: replace feature design with kernel design

                                             Los Alamos National Laboratory LAUR Number 06-2338 – p.17/30
Applying      to Anomaly


             
        Detection




                 Los Alamos National Laboratory LAUR Number 06-2338 – p.18/30
+ Anomaly Detection
 




                     Los Alamos National Laboratory LAUR Number 06-2338 – p.19/30
+ Anomaly Detection
  
Definition of Anomaly: is anomalous if its density values




                      T
falls below a threshold, i.e.   .




                           V
                               X W
                          U
                              T



                                      Y




                                          Los Alamos National Laboratory LAUR Number 06-2338 – p.19/30
+ Anomaly Detection
  
Definition of Anomaly: is anomalous if its density values




                          T
falls below a threshold, i.e.   .




                              V
                                   X W
                              U
                                  T



                                              Y
                                          p



                                                          ρ

                                  p ρ                                x
              anomalous       normal               anomalous




                                                  Los Alamos National Laboratory LAUR Number 06-2338 – p.19/30
+ Anomaly Detection
  
Definition of Anomaly: is anomalous if its density values




                          T
falls below a threshold, i.e.   .




                              V
                                   X W
                              U
                                  T



                                              Y
                                          p


                                                  f
                                                              ρ

                                  p ρ                                    x
                                  f0
                                   ∆


A detector       predicts an anomaly when                                     .



                                                          V
                                                                 bW
             `




                                                        `




                                                                         c
                                                               a
                                                             T
                                                      Los Alamos National Laboratory LAUR Number 06-2338 – p.19/30
+ Anomaly Detection
  
Definition of Anomaly: is anomalous if its density values




                          T
falls below a threshold, i.e.   .




                              V
                                   X W
                              U
                                  T



                                              Y
                                          p


                                                  f
                                                              ρ

                                  p ρ                                    x
                                  f0
                                   ∆


A detector       predicts an anomaly when                                     .



                                                          V
                                                                 bW
             `




                                                        `




                                                                         c
                                                               a
                                                             T
Criterion Function: the labeling error rate,


                                                               V
                                                                  efW




                                                                                  hV

                                                                                        W
                                                                    `




                                                                                  g
                                                             d
                                                      Los Alamos National Laboratory LAUR Number 06-2338 – p.19/30
+ Anomaly Detection
  
Information:
     First Principles: the operating environment is
     characterized by a stationary random process
     Empirical: (unlabeled) training data is sampled from that
     process




                                         Los Alamos National Laboratory LAUR Number 06-2338 – p.20/30
+ Anomaly Detection
  
Information:
     First Principles: the operating environment is
     characterized by a stationary random process
     Empirical: (unlabeled) training data is sampled from that
     process
Validation:
     Empirical: no reliable method for estimating                      !




                                                               V
                                                                    W
                                                                  `
                                                             d
                                         Los Alamos National Laboratory LAUR Number 06-2338 – p.20/30
+ Anomaly Detection
  
Information:
     First Principles: the operating environment is
     characterized by a stationary random process
     Empirical: (unlabeled) training data is sampled from that
     process
Validation:
     Empirical: no reliable method for estimating      !




                                                               V
                                                                    W
                                                                  `
                                                             d
     Theoretical: Substantial work on the accuracy of density
     estimation methods, but little work on their accuracy with
     respect to     !
                V
                    W
                   `
               d




                                         Los Alamos National Laboratory LAUR Number 06-2338 – p.20/30
+ AD = Recent Discovery
  
LANL discovered a function        that, with a mild assumption




                             di
on the distribution, is




                                           Los Alamos National Laboratory LAUR Number 06-2338 – p.21/30
+ AD = Recent Discovery
  
LANL discovered a function that, with a mild assumption




                          di
on the distribution, is
   calibrated with respect to , and




                           d




                                     Los Alamos National Laboratory LAUR Number 06-2338 – p.21/30
+ AD = Recent Discovery
  
LANL discovered a function that, with a mild assumption




                          di
on the distribution, is
   calibrated with respect to , and




                           d
   can be reliably estimated from sample data




                                     Los Alamos National Laboratory LAUR Number 06-2338 – p.21/30
+ AD = Recent Discovery
  
LANL discovered a function that, with a mild assumption




                          di
on the distribution, is
   calibrated with respect to , and




                           d
   can be reliably estimated from sample data

Consequences:




                                     Los Alamos National Laboratory LAUR Number 06-2338 – p.21/30
+ AD = Recent Discovery
  
LANL discovered a function that, with a mild assumption




                            di
on the distribution, is
   calibrated with respect to , and




                             d
   can be reliably estimated from sample data

Consequences:
  empirical validation is now possible!




                                      Los Alamos National Laboratory LAUR Number 06-2338 – p.21/30
+ AD = Recent Discovery
  
LANL discovered a function that, with a mild assumption




                          di
on the distribution, is
   calibrated with respect to , and




                           d
   can be reliably estimated from sample data

Consequences:
  empirical validation is now possible!
  direct solution methods can now be developed for AD




                                     Los Alamos National Laboratory LAUR Number 06-2338 – p.21/30
+ AD = Recent Discovery
  
LANL discovered a function that, with a mild assumption




                           di
on the distribution, is
   calibrated with respect to , and




                            d
   can be reliably estimated from sample data

Consequences:
  empirical validation is now possible!
  direct solution methods can now be developed for AD
  LANL has developed a direct solution method with
  properties similar to SVMs for supervised classification




                                      Los Alamos National Laboratory LAUR Number 06-2338 – p.21/30
Philosophy
 




           Los Alamos National Laboratory LAUR Number 06-2338 – p.22/30
Philosophy
  
Without a performance criterion model design is trivial.




                                    Los Alamos National Laboratory LAUR Number 06-2338 – p.23/30
Philosophy
  
Without a performance criterion model design is trivial.

Validation is the cornerstone of the scientific method.




                                    Los Alamos National Laboratory LAUR Number 06-2338 – p.23/30
Philosophy
  
Without a performance criterion model design is trivial.

Validation is the cornerstone of the scientific method.

Empirical and theoretical validation have different
strengths and weaknesses, and having both provides a
complete picture.




                                    Los Alamos National Laboratory LAUR Number 06-2338 – p.23/30
Philosophy
  
Without a performance criterion model design is trivial.

Validation is the cornerstone of the scientific method.

Empirical and theoretical validation have different
strengths and weaknesses, and having both provides a
complete picture.

FP and EMP information are both critical for success.
Myth: All you need is data.


                                    Los Alamos National Laboratory LAUR Number 06-2338 – p.23/30
Philosophy
   



Why there will never be a Nobel Prize in




                                                                         q
                                                                      p
                               Los Alamos National Laboratory LAUR Number 06-2338 – p.24/30
Philosophy
   



Why there will never be a Nobel Prize in




                                                                              q
                                                                           p
 Performance matters, but a first principles
 interpretation of the model does not.




                                    Los Alamos National Laboratory LAUR Number 06-2338 – p.24/30
Philosophy
   



Why there will never be a Nobel Prize in




                                                                              q
                                                                           p
 Performance matters, but a first principles
 interpretation of the model does not.


 Myth: A FP model is necessary to achieve good
 performance. example.




                                    Los Alamos National Laboratory LAUR Number 06-2338 – p.24/30
Challenges of Modern Data
Biggest challenge is not large amounts of data, but rather
    the lack of relevant information




                                        Los Alamos National Laboratory LAUR Number 06-2338 – p.25/30
Challenges of Modern Data
Biggest challenge is not large amounts of data, but rather
    the lack of relevant information
     large amount of data          large amount of information




                            s br




                                           Los Alamos National Laboratory LAUR Number 06-2338 – p.25/30
Challenges of Modern Data
Biggest challenge is not large amounts of data, but rather
    the lack of relevant information
     large amount of data          large amount of information




                            s br
    the mis–match between the natural structure of the
    data and the objects of interest




                                           Los Alamos National Laboratory LAUR Number 06-2338 – p.25/30
Challenges of Modern Data
Biggest challenge is not large amounts of data, but rather
    the lack of relevant information
     large amount of data          large amount of information




                            s br
    the mis–match between the natural structure of the
    data and the objects of interest

    the (increasing) gap between existing tools and the
    problems we want to solve




                                           Los Alamos National Laboratory LAUR Number 06-2338 – p.25/30
Challenges of Modern Data
Biggest challenge is not large amounts of data, but rather
    the lack of relevant information
     large amount of data               large amount of information




                                 s br
    the mis–match between the natural structure of the
    data and the objects of interest

    the (increasing) gap between existing tools and the
    problems we want to solve

    and last but not least ...


                                                Los Alamos National Laboratory LAUR Number 06-2338 – p.25/30
Challenges of Modern Data

  The lack of a scientific problem formulations !




                                 Los Alamos National Laboratory LAUR Number 06-2338 – p.26/30
Challenges of Modern Data

  The lack of a scientific problem formulations !

   How well does Google work?
   What is a meaningful performance criterion?
   How can it be validated?




                                     Los Alamos National Laboratory LAUR Number 06-2338 – p.26/30
THE END

Thank You!




             Los Alamos National Laboratory LAUR Number 06-2338 – p.27/30
SVMs and Curse of Dimensionality

       DARPA Intrusion Detection Data

         Dimension Error Rate (%)
                  27                 0.47
                                     0.18
                        x
             vt


                   wH
           u




                                     0.14
                        €
           uy


                   wH




                            return



                                            Los Alamos National Laboratory LAUR Number 06-2338 – p.28/30
Anomaly Detection on Graphs

  Individual graphs represent the interaction between people in a
  text unit (e.g book, magazine, newspaper, report, or sections of
  these types of documents).
  People (vertices) are labeled by their rank (1 = most important).




          Normal Graph
                                             Anomalous Graph

                              return

                                             Los Alamos National Laboratory LAUR Number 06-2338 – p.29/30
Gaussian Benchmark Problem
  The data is Gaussian
  The Gaussian Maximum Likelihood (GML) method uses a
  first principles model and the SVM uses a universal model.




                          return
                                       Los Alamos National Laboratory LAUR Number 06-2338 – p.30/30

Weitere ähnliche Inhalte

Ähnlich wie Machine Learning (20)

test3
test3test3
test3
 
test2
test2test2
test2
 
provoora
provooraprovoora
provoora
 
prova3
prova3prova3
prova3
 
stasera1
stasera1stasera1
stasera1
 
provarealw2
provarealw2provarealw2
provarealw2
 
prova5
prova5prova5
prova5
 
provarealw3
provarealw3provarealw3
provarealw3
 
finalelocale
finalelocalefinalelocale
finalelocale
 
testsfw3
testsfw3testsfw3
testsfw3
 
 
provacompleta3
provacompleta3provacompleta3
provacompleta3
 
provacompleta4
provacompleta4provacompleta4
provacompleta4
 
prova7
prova7prova7
prova7
 
domenica1
domenica1domenica1
domenica1
 
testsfw5
testsfw5testsfw5
testsfw5
 
prova9
prova9prova9
prova9
 
prova2
prova2prova2
prova2
 
test
testtest
test
 
prova2
prova2prova2
prova2
 

Mehr von butest

EL MODELO DE NEGOCIO DE YOUTUBE
EL MODELO DE NEGOCIO DE YOUTUBEEL MODELO DE NEGOCIO DE YOUTUBE
EL MODELO DE NEGOCIO DE YOUTUBEbutest
 
1. MPEG I.B.P frame之不同
1. MPEG I.B.P frame之不同1. MPEG I.B.P frame之不同
1. MPEG I.B.P frame之不同butest
 
LESSONS FROM THE MICHAEL JACKSON TRIAL
LESSONS FROM THE MICHAEL JACKSON TRIALLESSONS FROM THE MICHAEL JACKSON TRIAL
LESSONS FROM THE MICHAEL JACKSON TRIALbutest
 
Timeline: The Life of Michael Jackson
Timeline: The Life of Michael JacksonTimeline: The Life of Michael Jackson
Timeline: The Life of Michael Jacksonbutest
 
Popular Reading Last Updated April 1, 2010 Adams, Lorraine The ...
Popular Reading Last Updated April 1, 2010 Adams, Lorraine The ...Popular Reading Last Updated April 1, 2010 Adams, Lorraine The ...
Popular Reading Last Updated April 1, 2010 Adams, Lorraine The ...butest
 
LESSONS FROM THE MICHAEL JACKSON TRIAL
LESSONS FROM THE MICHAEL JACKSON TRIALLESSONS FROM THE MICHAEL JACKSON TRIAL
LESSONS FROM THE MICHAEL JACKSON TRIALbutest
 
Com 380, Summer II
Com 380, Summer IICom 380, Summer II
Com 380, Summer IIbutest
 
The MYnstrel Free Press Volume 2: Economic Struggles, Meet Jazz
The MYnstrel Free Press Volume 2: Economic Struggles, Meet JazzThe MYnstrel Free Press Volume 2: Economic Struggles, Meet Jazz
The MYnstrel Free Press Volume 2: Economic Struggles, Meet Jazzbutest
 
MICHAEL JACKSON.doc
MICHAEL JACKSON.docMICHAEL JACKSON.doc
MICHAEL JACKSON.docbutest
 
Social Networks: Twitter Facebook SL - Slide 1
Social Networks: Twitter Facebook SL - Slide 1Social Networks: Twitter Facebook SL - Slide 1
Social Networks: Twitter Facebook SL - Slide 1butest
 
Facebook
Facebook Facebook
Facebook butest
 
Executive Summary Hare Chevrolet is a General Motors dealership ...
Executive Summary Hare Chevrolet is a General Motors dealership ...Executive Summary Hare Chevrolet is a General Motors dealership ...
Executive Summary Hare Chevrolet is a General Motors dealership ...butest
 
Welcome to the Dougherty County Public Library's Facebook and ...
Welcome to the Dougherty County Public Library's Facebook and ...Welcome to the Dougherty County Public Library's Facebook and ...
Welcome to the Dougherty County Public Library's Facebook and ...butest
 
NEWS ANNOUNCEMENT
NEWS ANNOUNCEMENTNEWS ANNOUNCEMENT
NEWS ANNOUNCEMENTbutest
 
C-2100 Ultra Zoom.doc
C-2100 Ultra Zoom.docC-2100 Ultra Zoom.doc
C-2100 Ultra Zoom.docbutest
 
MAC Printing on ITS Printers.doc.doc
MAC Printing on ITS Printers.doc.docMAC Printing on ITS Printers.doc.doc
MAC Printing on ITS Printers.doc.docbutest
 
Mac OS X Guide.doc
Mac OS X Guide.docMac OS X Guide.doc
Mac OS X Guide.docbutest
 
WEB DESIGN!
WEB DESIGN!WEB DESIGN!
WEB DESIGN!butest
 

Mehr von butest (20)

EL MODELO DE NEGOCIO DE YOUTUBE
EL MODELO DE NEGOCIO DE YOUTUBEEL MODELO DE NEGOCIO DE YOUTUBE
EL MODELO DE NEGOCIO DE YOUTUBE
 
1. MPEG I.B.P frame之不同
1. MPEG I.B.P frame之不同1. MPEG I.B.P frame之不同
1. MPEG I.B.P frame之不同
 
LESSONS FROM THE MICHAEL JACKSON TRIAL
LESSONS FROM THE MICHAEL JACKSON TRIALLESSONS FROM THE MICHAEL JACKSON TRIAL
LESSONS FROM THE MICHAEL JACKSON TRIAL
 
Timeline: The Life of Michael Jackson
Timeline: The Life of Michael JacksonTimeline: The Life of Michael Jackson
Timeline: The Life of Michael Jackson
 
Popular Reading Last Updated April 1, 2010 Adams, Lorraine The ...
Popular Reading Last Updated April 1, 2010 Adams, Lorraine The ...Popular Reading Last Updated April 1, 2010 Adams, Lorraine The ...
Popular Reading Last Updated April 1, 2010 Adams, Lorraine The ...
 
LESSONS FROM THE MICHAEL JACKSON TRIAL
LESSONS FROM THE MICHAEL JACKSON TRIALLESSONS FROM THE MICHAEL JACKSON TRIAL
LESSONS FROM THE MICHAEL JACKSON TRIAL
 
Com 380, Summer II
Com 380, Summer IICom 380, Summer II
Com 380, Summer II
 
PPT
PPTPPT
PPT
 
The MYnstrel Free Press Volume 2: Economic Struggles, Meet Jazz
The MYnstrel Free Press Volume 2: Economic Struggles, Meet JazzThe MYnstrel Free Press Volume 2: Economic Struggles, Meet Jazz
The MYnstrel Free Press Volume 2: Economic Struggles, Meet Jazz
 
MICHAEL JACKSON.doc
MICHAEL JACKSON.docMICHAEL JACKSON.doc
MICHAEL JACKSON.doc
 
Social Networks: Twitter Facebook SL - Slide 1
Social Networks: Twitter Facebook SL - Slide 1Social Networks: Twitter Facebook SL - Slide 1
Social Networks: Twitter Facebook SL - Slide 1
 
Facebook
Facebook Facebook
Facebook
 
Executive Summary Hare Chevrolet is a General Motors dealership ...
Executive Summary Hare Chevrolet is a General Motors dealership ...Executive Summary Hare Chevrolet is a General Motors dealership ...
Executive Summary Hare Chevrolet is a General Motors dealership ...
 
Welcome to the Dougherty County Public Library's Facebook and ...
Welcome to the Dougherty County Public Library's Facebook and ...Welcome to the Dougherty County Public Library's Facebook and ...
Welcome to the Dougherty County Public Library's Facebook and ...
 
NEWS ANNOUNCEMENT
NEWS ANNOUNCEMENTNEWS ANNOUNCEMENT
NEWS ANNOUNCEMENT
 
C-2100 Ultra Zoom.doc
C-2100 Ultra Zoom.docC-2100 Ultra Zoom.doc
C-2100 Ultra Zoom.doc
 
MAC Printing on ITS Printers.doc.doc
MAC Printing on ITS Printers.doc.docMAC Printing on ITS Printers.doc.doc
MAC Printing on ITS Printers.doc.doc
 
Mac OS X Guide.doc
Mac OS X Guide.docMac OS X Guide.doc
Mac OS X Guide.doc
 
hier
hierhier
hier
 
WEB DESIGN!
WEB DESIGN!WEB DESIGN!
WEB DESIGN!
 

Machine Learning

  • 1. Machine Learning A Scientific Method or Just a Bag of Tools? Don Hush Machine Learning Team Group CCS-3, Los Alamos National Laboratory Los Alamos National Laboratory LAUR Number 06-2338 – p.1/30
  • 2. Machine Learning Toolbox Fisher’s Linear Discriminant Nearest Neighbor Neural Networks (backprop) Decision Trees (CART, C4.5) Boosting Support Vector Machines K–Means Clustering Principle Component Analysis (PCA) Expectation-Maximization (EM) ... and many more Los Alamos National Laboratory LAUR Number 06-2338 – p.2/30
  • 3. A Day at Work with the ML Toolbox Job Assignment: Design a system that uses the Tufts Artificial Nose to detect trichloroethylene (TCE). Tufts Data Collection: 760 samples with TCE 352 samples without TCE Tool: Support Vector Machine (SVM) Los Alamos National Laboratory LAUR Number 06-2338 – p.3/30
  • 4. A Day at Work ... The SVM Tool: produces a classifier and reports a classification error rate of 18% Los Alamos National Laboratory LAUR Number 06-2338 – p.4/30
  • 5. A Day at Work ... The SVM Tool: produces a classifier and reports a classification error rate of 18% However, when the classifier is deployed it produces an error rate of 38% Los Alamos National Laboratory LAUR Number 06-2338 – p.4/30
  • 6. A Day at Work ... The SVM Tool: produces a classifier and reports a classification error rate of 18% However, when the classifier is deployed it produces an error rate of 38% One of the (many) reasons why this is unacceptable: The naive classifier, that predicts NOT-TCE for every sample, produces an error rate of 10%. Los Alamos National Laboratory LAUR Number 06-2338 – p.4/30
  • 7. A Day at Work ... The SVM Tool: produces a classifier and reports a classification error rate of 18% However, when the classifier is deployed it produces an error rate of 38% One of the (many) reasons why this is unacceptable: The naive classifier, that predicts NOT-TCE for every sample, produces an error rate of 10%. What Went Wrong? Los Alamos National Laboratory LAUR Number 06-2338 – p.4/30
  • 8. A Day at Work ... The SVM Tool: produces a classifier and reports a classification error rate of 18% However, when the classifier is deployed it produces an error rate of 38% One of the (many) reasons why this is unacceptable: The naive classifier, that predicts NOT-TCE for every sample, produces an error rate of 10%. What Went Wrong? READ THE MANUAL Los Alamos National Laboratory LAUR Number 06-2338 – p.4/30
  • 9. The SVM Manual Entry SVMs ... assume that the operating environment is characterized by a stationary random process and that the training data is sampled from that process ... Los Alamos National Laboratory LAUR Number 06-2338 – p.5/30
  • 10. The SVM Manual Entry SVMs ... assume that the operating environment is characterized by a stationary random process and that the training data is sampled from that process ... Since the fraction of TCE samples in the training data is 0.7 and the fraction on the operating environ- ment is 0.1, the TCE problem violates the assump- tions! Los Alamos National Laboratory LAUR Number 06-2338 – p.5/30
  • 11. What Should We Do? tweak the SVM tool Los Alamos National Laboratory LAUR Number 06-2338 – p.6/30
  • 12. What Should We Do? tweak the SVM tool use a different tool from the toolbox Los Alamos National Laboratory LAUR Number 06-2338 – p.6/30
  • 13. What Should We Do? tweak the SVM tool use a different tool from the toolbox design a tool specifically for the TCE problem Los Alamos National Laboratory LAUR Number 06-2338 – p.6/30
  • 14. How do we design a new tool? Los Alamos National Laboratory LAUR Number 06-2338 – p.7/30
  • 15. Scientific Problem Formulation Key Ingredients: Specify a performance criterion – a measure of the quality of the model Los Alamos National Laboratory LAUR Number 06-2338 – p.8/30
  • 16. Scientific Problem Formulation Key Ingredients: Specify a performance criterion – a measure of the quality of the model Identify and characterize the information available to design the model Los Alamos National Laboratory LAUR Number 06-2338 – p.8/30
  • 17. Scientific Problem Formulation Key Ingredients: Specify a performance criterion – a measure of the quality of the model Identify and characterize the information available to design the model ... two types Empirical (EMP), i.e. data First Principles Knowledge (FP) Los Alamos National Laboratory LAUR Number 06-2338 – p.8/30
  • 18. Scientific Problem Formulation Key Ingredients: Specify a performance criterion – a measure of the quality of the model Identify and characterize the information available to design the model Establish a validation procedure – a way to evaluate (or estimate) the performance of a proposed model Los Alamos National Laboratory LAUR Number 06-2338 – p.8/30
  • 19. Scientific Problem Formulation Key Ingredients: Specify a performance criterion – a measure of the quality of the model Identify and characterize the information available to design the model Establish a validation procedure – a way to evaluate (or estimate) the performance of a proposed model ... two methods Empirical Tests Theoretical Analysis Los Alamos National Laboratory LAUR Number 06-2338 – p.8/30
  • 20. Scientific Problem Formulation Key Ingredients: Specify a performance criterion – a measure of the quality of the model Identify and characterize the information available to design the model Establish a validation procedure – a way to evaluate (or estimate) the performance of a proposed model All of this is done before we develop a solution method. Los Alamos National Laboratory LAUR Number 06-2338 – p.8/30
  • 21. ML + Scientific Method   ¡ The Approach: £¢ 1. Construct a scientific problem formulation. Los Alamos National Laboratory LAUR Number 06-2338 – p.9/30
  • 22. ML + Scientific Method   ¡ The Approach: £¢ 1. Construct a scientific problem formulation. 2. Determine its feasibility. If not feasible, go to step 1. Los Alamos National Laboratory LAUR Number 06-2338 – p.9/30
  • 23. ML + Scientific Method   ¡ The Approach: £¢ 1. Construct a scientific problem formulation. 2. Determine its feasibility. If not feasible, go to step 1. 3. If feasible then use any means necessary to determine a solution method that is guaranteed to be practical (e.g. computationally feasible), and provide good performance (e.g. near optimal) Los Alamos National Laboratory LAUR Number 06-2338 – p.9/30
  • 24. ML + Scientific Method   ¡ The Approach: £¢ 1. Construct a scientific problem formulation. 2. Determine its feasibility. If not feasible, go to step 1. 3. If feasible then use any means necessary to determine a solution method that is guaranteed to be practical (e.g. computationally feasible), and provide good performance (e.g. near optimal) (although obvious, very few tools are designed to provide such guarantees!) Los Alamos National Laboratory LAUR Number 06-2338 – p.9/30
  • 25. An Example: Applying to the   Supervised Classification Problem Los Alamos National Laboratory LAUR Number 06-2338 – p.10/30
  • 26. + Supervised Classification   A model assigns the label to data point . ¤ ¤ ¨¦ ¥ §© Los Alamos National Laboratory LAUR Number 06-2338 – p.11/30
  • 27. + Supervised Classification   A model assigns the label to data point . ¤ ¤ ¨¦ ¥ §© Performance Criterion: classification error rate, ¤ Los Alamos National Laboratory LAUR Number 06-2338 – p.11/30
  • 28. + Supervised Classification   A model assigns the label to data point . ¤ ¤ ¨¦ ¥ §© Performance Criterion: classification error rate, ¤ Information: First Principles: the operating environment is characterized by a stationary random process Empirical: (labeled) training data is sampled from that process Los Alamos National Laboratory LAUR Number 06-2338 – p.11/30
  • 29. + Supervised Classification   A model assigns the label to data point . ¤ ¤ ¨¦ ¥ §© Performance Criterion: classification error rate, ¤ Information: First Principles: the operating environment is characterized by a stationary random process Empirical: (labeled) training data is sampled from that process Validation: Empirical: hold-out, cross-validation, bootstrap Los Alamos National Laboratory LAUR Number 06-2338 – p.11/30
  • 30. + Supervised Classification   A model assigns the label to data point . ¤ ¤ ¨¦ ¥ §© Performance Criterion: classification error rate, ¤ Information: First Principles: the operating environment is characterized by a stationary random process Empirical: (labeled) training data is sampled from that process Validation: Empirical: hold-out, cross-validation, bootstrap allows us to compare methods, but does not tell us how close we are to optimal Los Alamos National Laboratory LAUR Number 06-2338 – p.11/30
  • 31. + Supervised Classification   A model assigns the label to data point . ¤ ¤ ¨¦ ¥ §© Performance Criterion: classification error rate, ¤ Information: First Principles: the operating environment is characterized by a stationary random process Empirical: (labeled) training data is sampled from that process Validation: Empirical: hold-out, cross-validation, bootstrap Theoretical: Statistics + Computer Science Los Alamos National Laboratory LAUR Number 06-2338 – p.11/30
  • 32. + Supervised Classification   A model assigns the label to data point . ¤ ¤ ¨¦ ¥ §© Performance Criterion: classification error rate, ¤ Information: First Principles: the operating environment is characterized by a stationary random process Empirical: (labeled) training data is sampled from that process Validation: Empirical: hold-out, cross-validation, bootstrap Theoretical: Statistics + Computer Science probably approximately correct (PAC) analysis Los Alamos National Laboratory LAUR Number 06-2338 – p.11/30
  • 33. Support Vector Machines (SVMs) PAC Result: With mild assumptions on the distribution the SVM with training samples requires $ # ! % computation to produce a classifier with performance ( )' 12 9 ' B@ 7 34 6 0 0 5 8 A ( where is the theoretical minimum error and the rate 3 EC G D FD 0 depends on the distribution. Los Alamos National Laboratory LAUR Number 06-2338 – p.12/30
  • 34. Support Vector Machines (SVMs) PAC Result: With mild assumptions on the distribution the SVM with training samples requires $ # ! % computation to produce a classifier with performance ( )' 12 9 ' B@ 7 34 6 0 0 5 8 A ( where is the theoretical minimum error and the rate 3 EC G D FD 0 depends on the distribution. Observation: This result addresses the major practical concerns: performance (of the actual classifier produced) computation (of the actual algorithm used) generality (applies to very large class of distributions) Los Alamos National Laboratory LAUR Number 06-2338 – p.12/30
  • 35. Applying to the TCE Problem   Los Alamos National Laboratory LAUR Number 06-2338 – p.13/30
  • 36. + TCE Problem   Scientific Problem Formulation: Same as supervised classification except that Los Alamos National Laboratory LAUR Number 06-2338 – p.14/30
  • 37. + TCE Problem   Scientific Problem Formulation: Same as supervised classification except that we assume a nonstationary random process because the fraction of time that TCE is present varies over the range . Q I P H Los Alamos National Laboratory LAUR Number 06-2338 – p.14/30
  • 38. + TCE Problem   Scientific Problem Formulation: Same as supervised classification except that we assume a nonstationary random process because the fraction of time that TCE is present varies over the range . Q I P H the performance criterion is the error rate for the worst possible value in the range . Q I P H (this is a min–max problem) Los Alamos National Laboratory LAUR Number 06-2338 – p.14/30
  • 39. TCE Tool Los Alamos National Laboratory LAUR Number 06-2338 – p.15/30
  • 40. Impacts of on Data Driven   Modeling Los Alamos National Laboratory LAUR Number 06-2338 – p.16/30
  • 41. Impacts of on DDM   It has focused attention on Direct Solution Methods Los Alamos National Laboratory LAUR Number 06-2338 – p.17/30
  • 42. Impacts of on DDM   It has focused attention on Direct Solution Methods Direct Method: choose a model that minimizes an empirical risk function that is calibrated with respect to the performance criterion. Los Alamos National Laboratory LAUR Number 06-2338 – p.17/30
  • 43. Impacts of on DDM   It has focused attention on Direct Solution Methods Paradigm Shift?: replace maximum likelihood, maximum entropy, least squares, plug–in, etc. with calibrated empirical risk minimization Los Alamos National Laboratory LAUR Number 06-2338 – p.17/30
  • 44. Impacts of on DDM   It has focused attention on Direct Solution Methods It has started a movement towards end–to–end learning Los Alamos National Laboratory LAUR Number 06-2338 – p.17/30
  • 45. Impacts of on DDM   It has focused attention on Direct Solution Methods It has started a movement towards end–to–end learning Traditional Approach: Los Alamos National Laboratory LAUR Number 06-2338 – p.17/30
  • 46. Impacts of on DDM   It has focused attention on Direct Solution Methods It has started a movement towards end–to–end learning Traditional Approach: New Approach: Los Alamos National Laboratory LAUR Number 06-2338 – p.17/30
  • 47. Impacts of on DDM   It has focused attention on Direct Solution Methods It has started a movement towards end–to–end learning Traditional Approach: New Approach: push learning into earlier stages Los Alamos National Laboratory LAUR Number 06-2338 – p.17/30
  • 48. Impacts of on DDM   It has focused attention on Direct Solution Methods It has started a movement towards end–to–end learning Traditional Approach: New Approach: push learning into earlier stages collapse last two stages into one using model classes that are Los Alamos National Laboratory LAUR Number 06-2338 – p.17/30
  • 49. Impacts of on DDM   It has focused attention on Direct Solution Methods It has started a movement towards end–to–end learning Traditional Approach: New Approach: push learning into earlier stages collapse last two stages into one using model classes that are richer (e.g. higher dimensions) Los Alamos National Laboratory LAUR Number 06-2338 – p.17/30
  • 50. Impacts of on DDM   It has focused attention on Direct Solution Methods It has started a movement towards end–to–end learning Traditional Approach: New Approach: push learning into earlier stages collapse last two stages into one using model classes that are richer (e.g. higher dimensions) Myth: dimensionality must be reduced to achieve good performance. example Los Alamos National Laboratory LAUR Number 06-2338 – p.17/30
  • 51. Impacts of on DDM   It has focused attention on Direct Solution Methods It has started a movement towards end–to–end learning Traditional Approach: New Approach: push learning into earlier stages collapse last two stages into one using model classes that are richer (e.g. higher dimensions) more flexible (e.g. accommodates different data types) Los Alamos National Laboratory LAUR Number 06-2338 – p.17/30
  • 52. Impacts of on DDM   It has focused attention on Direct Solution Methods It has started a movement towards end–to–end learning Traditional Approach: New Approach: push learning into earlier stages collapse last two stages into one using model classes that are richer (e.g. higher dimensions) more flexible (e.g. accommodates different data types) Myth: data must be mapped to before we can build a S R model. example Los Alamos National Laboratory LAUR Number 06-2338 – p.17/30
  • 53. Impacts of on DDM   It has focused attention on Direct Solution Methods It has started a movement towards end–to–end learning Traditional Approach: New Approach: push learning into earlier stages collapse last two stages into one using model classes that are richer (e.g. higher dimensions) more flexible (e.g. accommodates different data types) simply parameterized Los Alamos National Laboratory LAUR Number 06-2338 – p.17/30
  • 54. Impacts of on DDM   It has focused attention on Direct Solution Methods It has started a movement towards end–to–end learning Traditional Approach: New Approach: push learning into earlier stages collapse last two stages into one using model classes that are richer (e.g. higher dimensions) more flexible (e.g. accommodates different data types) simply parameterized Example: Kernel Machines Los Alamos National Laboratory LAUR Number 06-2338 – p.17/30
  • 55. Impacts of on DDM   It has focused attention on Direct Solution Methods It has started a movement towards end–to–end learning Traditional Approach: New Approach: push learning into earlier stages collapse last two stages into one using model classes that are richer (e.g. higher dimensions) more flexible (e.g. accommodates different data types) simply parameterized Paradigm Shift?: replace feature design with kernel design Los Alamos National Laboratory LAUR Number 06-2338 – p.17/30
  • 56. Applying to Anomaly   Detection Los Alamos National Laboratory LAUR Number 06-2338 – p.18/30
  • 57. + Anomaly Detection   Los Alamos National Laboratory LAUR Number 06-2338 – p.19/30
  • 58. + Anomaly Detection   Definition of Anomaly: is anomalous if its density values T falls below a threshold, i.e. . V X W U T Y Los Alamos National Laboratory LAUR Number 06-2338 – p.19/30
  • 59. + Anomaly Detection   Definition of Anomaly: is anomalous if its density values T falls below a threshold, i.e. . V X W U T Y p ρ p ρ x anomalous normal anomalous Los Alamos National Laboratory LAUR Number 06-2338 – p.19/30
  • 60. + Anomaly Detection   Definition of Anomaly: is anomalous if its density values T falls below a threshold, i.e. . V X W U T Y p f ρ p ρ x f0 ∆ A detector predicts an anomaly when . V bW ` ` c a T Los Alamos National Laboratory LAUR Number 06-2338 – p.19/30
  • 61. + Anomaly Detection   Definition of Anomaly: is anomalous if its density values T falls below a threshold, i.e. . V X W U T Y p f ρ p ρ x f0 ∆ A detector predicts an anomaly when . V bW ` ` c a T Criterion Function: the labeling error rate, V efW hV W ` g d Los Alamos National Laboratory LAUR Number 06-2338 – p.19/30
  • 62. + Anomaly Detection   Information: First Principles: the operating environment is characterized by a stationary random process Empirical: (unlabeled) training data is sampled from that process Los Alamos National Laboratory LAUR Number 06-2338 – p.20/30
  • 63. + Anomaly Detection   Information: First Principles: the operating environment is characterized by a stationary random process Empirical: (unlabeled) training data is sampled from that process Validation: Empirical: no reliable method for estimating ! V W ` d Los Alamos National Laboratory LAUR Number 06-2338 – p.20/30
  • 64. + Anomaly Detection   Information: First Principles: the operating environment is characterized by a stationary random process Empirical: (unlabeled) training data is sampled from that process Validation: Empirical: no reliable method for estimating ! V W ` d Theoretical: Substantial work on the accuracy of density estimation methods, but little work on their accuracy with respect to ! V W ` d Los Alamos National Laboratory LAUR Number 06-2338 – p.20/30
  • 65. + AD = Recent Discovery   LANL discovered a function that, with a mild assumption di on the distribution, is Los Alamos National Laboratory LAUR Number 06-2338 – p.21/30
  • 66. + AD = Recent Discovery   LANL discovered a function that, with a mild assumption di on the distribution, is calibrated with respect to , and d Los Alamos National Laboratory LAUR Number 06-2338 – p.21/30
  • 67. + AD = Recent Discovery   LANL discovered a function that, with a mild assumption di on the distribution, is calibrated with respect to , and d can be reliably estimated from sample data Los Alamos National Laboratory LAUR Number 06-2338 – p.21/30
  • 68. + AD = Recent Discovery   LANL discovered a function that, with a mild assumption di on the distribution, is calibrated with respect to , and d can be reliably estimated from sample data Consequences: Los Alamos National Laboratory LAUR Number 06-2338 – p.21/30
  • 69. + AD = Recent Discovery   LANL discovered a function that, with a mild assumption di on the distribution, is calibrated with respect to , and d can be reliably estimated from sample data Consequences: empirical validation is now possible! Los Alamos National Laboratory LAUR Number 06-2338 – p.21/30
  • 70. + AD = Recent Discovery   LANL discovered a function that, with a mild assumption di on the distribution, is calibrated with respect to , and d can be reliably estimated from sample data Consequences: empirical validation is now possible! direct solution methods can now be developed for AD Los Alamos National Laboratory LAUR Number 06-2338 – p.21/30
  • 71. + AD = Recent Discovery   LANL discovered a function that, with a mild assumption di on the distribution, is calibrated with respect to , and d can be reliably estimated from sample data Consequences: empirical validation is now possible! direct solution methods can now be developed for AD LANL has developed a direct solution method with properties similar to SVMs for supervised classification Los Alamos National Laboratory LAUR Number 06-2338 – p.21/30
  • 72. Philosophy   Los Alamos National Laboratory LAUR Number 06-2338 – p.22/30
  • 73. Philosophy   Without a performance criterion model design is trivial. Los Alamos National Laboratory LAUR Number 06-2338 – p.23/30
  • 74. Philosophy   Without a performance criterion model design is trivial. Validation is the cornerstone of the scientific method. Los Alamos National Laboratory LAUR Number 06-2338 – p.23/30
  • 75. Philosophy   Without a performance criterion model design is trivial. Validation is the cornerstone of the scientific method. Empirical and theoretical validation have different strengths and weaknesses, and having both provides a complete picture. Los Alamos National Laboratory LAUR Number 06-2338 – p.23/30
  • 76. Philosophy   Without a performance criterion model design is trivial. Validation is the cornerstone of the scientific method. Empirical and theoretical validation have different strengths and weaknesses, and having both provides a complete picture. FP and EMP information are both critical for success. Myth: All you need is data. Los Alamos National Laboratory LAUR Number 06-2338 – p.23/30
  • 77. Philosophy   Why there will never be a Nobel Prize in q p Los Alamos National Laboratory LAUR Number 06-2338 – p.24/30
  • 78. Philosophy   Why there will never be a Nobel Prize in q p Performance matters, but a first principles interpretation of the model does not. Los Alamos National Laboratory LAUR Number 06-2338 – p.24/30
  • 79. Philosophy   Why there will never be a Nobel Prize in q p Performance matters, but a first principles interpretation of the model does not. Myth: A FP model is necessary to achieve good performance. example. Los Alamos National Laboratory LAUR Number 06-2338 – p.24/30
  • 80. Challenges of Modern Data Biggest challenge is not large amounts of data, but rather the lack of relevant information Los Alamos National Laboratory LAUR Number 06-2338 – p.25/30
  • 81. Challenges of Modern Data Biggest challenge is not large amounts of data, but rather the lack of relevant information large amount of data large amount of information s br Los Alamos National Laboratory LAUR Number 06-2338 – p.25/30
  • 82. Challenges of Modern Data Biggest challenge is not large amounts of data, but rather the lack of relevant information large amount of data large amount of information s br the mis–match between the natural structure of the data and the objects of interest Los Alamos National Laboratory LAUR Number 06-2338 – p.25/30
  • 83. Challenges of Modern Data Biggest challenge is not large amounts of data, but rather the lack of relevant information large amount of data large amount of information s br the mis–match between the natural structure of the data and the objects of interest the (increasing) gap between existing tools and the problems we want to solve Los Alamos National Laboratory LAUR Number 06-2338 – p.25/30
  • 84. Challenges of Modern Data Biggest challenge is not large amounts of data, but rather the lack of relevant information large amount of data large amount of information s br the mis–match between the natural structure of the data and the objects of interest the (increasing) gap between existing tools and the problems we want to solve and last but not least ... Los Alamos National Laboratory LAUR Number 06-2338 – p.25/30
  • 85. Challenges of Modern Data The lack of a scientific problem formulations ! Los Alamos National Laboratory LAUR Number 06-2338 – p.26/30
  • 86. Challenges of Modern Data The lack of a scientific problem formulations ! How well does Google work? What is a meaningful performance criterion? How can it be validated? Los Alamos National Laboratory LAUR Number 06-2338 – p.26/30
  • 87. THE END Thank You! Los Alamos National Laboratory LAUR Number 06-2338 – p.27/30
  • 88. SVMs and Curse of Dimensionality DARPA Intrusion Detection Data Dimension Error Rate (%) 27 0.47 0.18 x vt wH u 0.14 € uy wH return Los Alamos National Laboratory LAUR Number 06-2338 – p.28/30
  • 89. Anomaly Detection on Graphs Individual graphs represent the interaction between people in a text unit (e.g book, magazine, newspaper, report, or sections of these types of documents). People (vertices) are labeled by their rank (1 = most important). Normal Graph Anomalous Graph return Los Alamos National Laboratory LAUR Number 06-2338 – p.29/30
  • 90. Gaussian Benchmark Problem The data is Gaussian The Gaussian Maximum Likelihood (GML) method uses a first principles model and the SVM uses a universal model. return Los Alamos National Laboratory LAUR Number 06-2338 – p.30/30