SlideShare ist ein Scribd-Unternehmen logo
1 von 92
Dan Steinberg
  N Scott Cardell
Mykhaylo Golovnya
 November, 2011
 Salford Systems
Data Mining              Data Mining              Cont.

• Predictive Analytics   • Statistics         • OLAP
• Machine Learning       • Computer science   • CART
• Pattern Recognition    • Database           • SVM
• Artificial               Management         • NN
  Intelligence           • Insurance          • CRISP-DM
• Business               • Finance            • CRM
  Intelligence           • Marketing          • KDD
• Data Warehousing       • Electrical         • Etc.
                           Engineering
                         • Robotics
                         • Biotech and more
   Data mining is the search for patterns in data
    using modern highly automated, computer
    intensive methods

    ◦ Data mining may be best defined as the use of a specific
      class of tools (data mining methods) in the analysis of
      data

    ◦ The term search is key to this definition, as is
      “automated”

   The literature often refers to finding hidden
    information in data
• Study the phenomenon
              • Understand its nature
  Science     • Try to discover a law
              • The laws usually hold for a long time

              •Collect some data
              •Guess the model (perhaps, using science)
 Statistics   •Use the data to clarify and/or validate the model
              •If looks “fishy”, pick another model and do it
               again

              • Access to lots of data
              • No clue what the model might be
Data Mining   • No long term law is even possible
              • Let the machine build a model
              • And let‟s use this model while we can
   Quest for the Holy Grail- build an algorithm that will
    always find 100% accurate models

   Absolute Powers- data mining will finally find and
    explain everything

   Gold Rush- with the right tool one can rip the stock-
    market and become obscenely rich

   Magic Wand- getting a complete solution from start
    to finish with a single button push

   Doomsday Scenario- all conventional analysts will
    eventually be replaced by smart computer chips
   This is known as “supervised learning”

    ◦ We will focus on patterns that allow us to accomplish two tasks
       Classification
       Regression

   This is known as “unsupervised learning”

    ◦ We will briefly touch on a third common task
       Finding groups in data (clustering, density estimation)

   There are other patterns we will not discuss today
    including

    ◦ Patterns in sequences
    ◦ Connections in networks (the web, social networks, link analysis)
   CART® (Decision Trees, C4.5, CHAID among others)

   MARS® (Multivariate Adaptive Regression Splines)

   Artificial Neural Networks (ANNs, many commercial)

   Association Rules (Clustering, market basket analysis)

   TreeNet® (Stochastic Gradient Tree Boosting)

   RandomForests® (Ensembles of trees w/ random splits)

   Genetic Algorithms (evolutionary model development)

   Self Organizing Maps (SOM, like k-means clustering)

   Support Vector Machine (SVM wrapped in many patents)

   Nearest Neighbor Classifiers
   (Insert chart)

   In a nutshell: Use historical data to gain
    insights and/or make predictions on the new
    data
   Given enough learning iterations, most data mining methods are
    capable of explaining everything they see in the input
    data, including noise

   Thus one cannot rely on conventional (whole sample) statistical
    measures of model quality

   A common technique is to partition historical data into several
    mutually exclusive parts

    ◦ LEARN set is used to build a sequence of models varying in size and level
      of explained details

    ◦ TEST set is used to evaluate each candidate model and suggest the
      optimal one

    ◦ VALIDATE set is sometimes used to independently confirm the optimal
      model performance on yet another sample
Historical Data     Build a
                  Sequence of
    Learn           Models


                    Monitor
     Test
                  Performance



                   Confirm
   Validate
                   Findings
   Analyst needs to indicate where the TEST data is to be
    found
    ◦ Stored in a separate file

    ◦ Selected at random from the available data

    ◦ Pre-selected from available data and marked by a special indicator

   Other things to consider
    ◦ Population: LEARN and TEST sets come from different populations
      (within-sample versus out-of-sample)

    ◦ Time: LEARN and TEST sets come from different time periods
      (within-time versus out-of-time)

    ◦ Aggregation: logically grouped records must be all included or all
      excluded within each set (self-correlation)
   Any model is built on past data!

   Fortunately, many models trace stable patterns of behavior

   However, any model will eventually have to be rebuilt:
    ◦ Banks like to refresh risk models about every 12 months

    ◦ Targeted marketing models are typically refreshed every 3 months

    ◦ Ad web-server models may be refreshed every 24 hours

   Credit risk score card expert Professor David
    Hand, University of London maintains:
    ◦ A predictive model is obsolete the day it is first deployed
   Model evaluation is at the core of the learning process (choosing
    the optimal model from a list of candidates)

   Model evaluation is also a key part in comparing performance of
    different algorithms

   Finally, model evaluation is needed to continuously monitor
    model performance over time

   In predictive modeling (classification and regression) all we need
    is a sample of data with known outcome; different evaluation
    criteria can then be applied

   There will never be “the best for all” model; the optimality is
    contingent upon current evaluation criterion and thus depends
    on the context in which the model is applied
   (insert graph)

   One usually computes some measure of average
    discrepancy between the continuous model predictions f
    and the actual outcome y
    ◦ Least Squared Deviation: R= Σ(y-f)^2
    ◦ Least Absolute Deviation: R= Σ Iy-fI

   Fancier definitions also exist
    ◦ Huber-M Loss: is defined as a hybrid between the LS and LAD
      losses
    ◦ SVM Loss: ignores very small discrepancies and then switches to
      LAD-style

   The raw loss value is often re-expressed in relative terms
    as R-squared
   There are three progressively more demanding approaches to
    solving binary classification problems

   Division: a model makes the final class assignment for each
    observation internally
    ◦ Observations with identical class assignment are no longer discriminated
    ◦ A model needs to be rebuilt to change decision rules

   Rank: a model assigns a continuous score to each observation
    ◦ The score on its own bears no direct interpretation

    ◦ But, higher class score means higher likelihood of class presence in
      general (without precise quantitative statements)

    ◦ Any monotone transformation of scores is admissible

    ◦ A spectrum of decision rules can be constructed strictly based on varying
      score threshold without model rebuilding

   Probability: a model assigns a probability score to each observation
    ◦ Same as above, but the output is interpreted directly in the exact probabilistic
      terms
   Depending on the prediction emphasis, various performance evaluation
    criteria can be constructed for binary classification models

   The following list, far from being exhausting, presents some of the
    frequently used evaluation criteria
    ◦ Accuracy (more generally- Expected Cost)
       Applicable to all models

    ◦ ROC Curve and Area Under Curve
       Not Applicable to Division Models

    ◦ Gains and Lift
       Not Applicable to Division Models

    ◦ Log-likelihood (a.k.a Cross-Entropy, Deviate)
       Not Applicable to Division and Rank Models

   The criteria above are listed in the order from the least specific to the
    most

   It is not guaranteed that all criteria will suggest the same model as the
    optimal from a list of candidate models
   Most intuitive and also the weakest evaluation method that can be
    applied to any classification model

   Each record must be assigned to a specific class

   One first constructs a Prediction Success Table- a 2 by 2 matrix showing
    how many true 0s and 1s (rows) were classified by the model correctly or
    incorrectly (columns)

   The classification accuracy is then the number of correct class
    assignments divided by the sample size

   More general approaches will also include user supplied prior
    probabilities and cost matrix to compute the Expected Cost

   The example below reports prediction success tables for two separate
    models along with the accuracy calculations

   The method is not sensitive enough to emphasize larger class unbalance
    in model 1
   (insert table)
   The classification accuracy approach assumes that each record
    has already been classified which is not always convenient

    ◦ Those algorithms producing a continuous score (Rank or Probability) will
      require a user-specified threshold to make final class assignments

    ◦ Different thresholds will result to different class assignments and likely
      different classification accuracies

   The accuracy approach focuses on the separating boundary and
    ignores fine probability structure outside the boundary

   Ideally, need an evaluator working directly with the score itself
    and not dependent on any external considerations like costs and
    thresholds

   Also, for Rank models the evaluator needs to be invariant with
    respect to monotone transformation of the scores so that the
    “spirit” of such models is not violated
   The following approach will take full advantage of the set of
    continuous scores produced by Rank or Probability models

   Pick one of the two target classes as the class in focus

   Sort a database by predicted score in descending order

   Choose a set of different score values
    ◦ Could be ALL of the unique scores produced by the model
    ◦ More often a set of scores obtained by binning sorted records into equal
      size bins

   For any fixed value of the score we can now compute:
    ◦ Sensitivity (a.k.a True Positive): Percent of the class in focus with the
      predicted scores above the threshold
    ◦ Specificity (a.k.a False Positive): Percent of the opposite class with the
      predicted scores below the threshold

   We then display the results as a plot of [sensitivity] versus [1-specificity]

   The resulting curve is known as the ROC Curve
   (insert graph)
   ROC Curves for three different rank models are shown

   No model can be considered as the absolute best in all times

   The optimal model selection will rest with the user

   Average overall performance can be measured as Area Under
    ROC Curve (AUC)
    ◦ ROC Curve (up to orientation) and AUC are invariant with respect to the
      focus class selection
    ◦ The best attainable AUS is always 1.0
    ◦ AUC of a model with randomly assigned scores is 0.5

   AUC can be interpreted
    ◦ Suppose we randomly and repeatedly pick one observation at random
      from the focus class and another observation from the opposite class
    ◦ Then AUC is the fraction of trials resulting to the focus class observation
      having greater predicted score than the opposite class observation
    ◦ AUC below 0.5 means that something is fundamentally wrong
   The following example justifies another slightly different approach to model
    evaluation
   Suppose we want to mail a certain offer to P fraction of the population

   Mailing to a randomly chosen sample will capture about P fraction of the
    responders (random sampling procedure)

   Now suppose that we have access to a response model which ranks each potential
    responder by a score

   Now if we sample the P fraction of the population targeting members with the
    highest predicted scores first (model guided sampling), we could now get T
    fraction of the responders which we expect to be higher than P

   The lift in P(th) percentile is defined as the ratio T/P

   Obviously, meaningful models will always produce lift greater than 1

   The process can be repeated for all possible percentiles and the results can be
    summarized graphically as Gains and Cumulative Lift curves

   In practice, one usually first sorts observations by scores and then partitions
    sorted data into a fixed number of bins to save on calculations just like it is
    usually done for ROC curves
   (insert graphs and tables)
   (insert graphs)
   Lift in the given percentile provides a point measure of
    performance for the given population cutoff
    ◦ Can be viewed as the relative length of the vertical line segment
      connecting the gains curve at the given population cutoff

   Area Under the Gains curve (AUG): Provides an integral measure
    of performance across all bins
    ◦ Unlike AUC, the largest attainable value of AUG is (1-p/2), P being the
      fraction of responders in the population

   Just like ROC-curves, gains and lift curves for different models
    can intersect, so that performance-wise one model is better for
    one range of cutoffs while another model is better for a different
    range

   Unlike ROC-curve, gains and lift curves do depend on the class
    in focus
    ◦ For the dominant class, gains and lift curves degenerate to the trivial 45-
      degree line random case
   ROC, Gains, and lift curves together with AUC and AUG are invariant
    with respect to monotone transformation of the model scores
    ◦ Scores are only used to sort records in the evaluation set, the actual score
      values are of no consequence

   All these measures address the same conceptual phenomenon
    emphasizing different sides and thus can be easily derived from
    each other
    ◦ Any point (P,G) on a gains curve corresponds to the point (P,G/P) on the
      lift curve
    ◦ Suppose that the focus class occupies fraction F of the population; then
      any point (P,G) on a gains curve corresponds to the point {(P-FG)/(1-F),G}
      on the ROC curve

       It follows that the ROC graph “pushes” the gains graph “away” from the 45
        degree line
       Dominant focus class (large F) is “pushed” harder so that the degeneracy of
        its gain curve disappears
       In contrast, rare focus class (small F) has ROC curve naturally “close” to the
        gains curve

   All of these measures are widely used as robust performance
    evaluations in various practical applications
   When the output score can be interpreted as probability, a more specific
    evaluation criterion can be constructed to access probabilistic accuracy
    of the model

   We assume that the model generates P(X)-the conditional probability of
    1 given X

   We also assume that the binary target Y is coded as -1 and +1 (only for
    notational convenience)

   The Cross-Entropy (CXE) criterion is then computed as (insert equation)
    ◦ The inner Log computes the log-odds of Y=1
    ◦ The value itself is the negative log-likelihood assuming independence of
      responses
    ◦ Alternative notation assumes 0/1 target coding and uses the following
      formula (insert equation)
    ◦ The values produced by either of the formula will be identical to each
      other

   Model with the smallest CXE means the largest likelihood and thus
    considered to be the best in terms of capturing the right probability
    structure
   The example shows true non-monotonic conditional probability
    (dark blue curve)

   We generated 5,000 LEARN and TEST observations based on this
    probability model

   We report predicted responses generated by different modeling
    approaches
    ◦ Red- best accuracy MART model

    ◦ Yellow- best CXE MART model

    ◦ Cyan- univariate LOGIT model

   Performance-wise
    ◦ All models have identical accuracy but the best accuracy model is
      substantially worse in terms of CXE

    ◦ LOGIT can‟t capture departure from monotonicity as reported by CXE
   MARS is a highly-automated tool for regression

   Developed by Jerome H. Friedman of Stanford University
    ◦ Annals of statistics, 1991 dense 65 page article
    ◦ Takes some inspiration from its ancestor CART®
    ◦ Produces smooth curves and surfaces, not the step-functions of CART

   Appropriate target variables are continuous

   End result of a MARS run is a regression model
    ◦   MARS automatically chooses which variables to use
    ◦   Variables are optimally transformed
    ◦   Interactions are detected
    ◦   Model is self-tested to protect against over-fitting

   Can also perform well on binary dependent variables
    ◦ Censored survival model (waiting time models as in churn)
   Harrison, D. and D. Rubinfeld.
    Hedonic Housing Prices and Demand for Clean Air. Journal of
    Environmental Economics and Management v5, 81-102, 1978

   506 census tracts in city of Boston for the year 1970

   Goal: study relationship between quality of life variables and property
    values

    ◦   MV- median value of owner-occupied homes in tract („000s)
    ◦   CRIM- per capita crime rates
    ◦   NOX- concentration of nitrogen oxides (pphm)
    ◦   AGE- percent built before 1940
    ◦   DIS- weighted distance to centers of employment
    ◦   RM- average number of rooms per house
    ◦   LSTAT- percent neighborhood „lower socio-economic status‟
    ◦   RAD- accessibility to radial highways
    ◦   CHAS- borders Charles River (0/1)
    ◦   INDUS- percent non-retail business
    ◦   TAX- tax rate
    ◦   PT- pupil teacher ratio
   (insert graph)
   The dataset poses significant challenges to
    conventional regression modeling

    ◦ Clearly departure from normality, non-linear
      relationships, and skewed distributions

    ◦ Multicollinearity, mutual dependency, and outlying
      observations
   (insert graph)

   A typical MARS solution (univariate for simplicity)
    is shown above

    ◦ Essentially a piece-wise linear regression model with the
      continuity requirement at the transition points called
      knots

    ◦ The locations and number of knots were determined
      automatically to ensure the best possible model fit

    ◦ The solution can be analytically expressed as
      conventional regression equations
   Finding the one best knot in a simple regression is a straightforward
    search problem
    ◦ Try a large number of potential knots and choose one with the best R-
      squared
    ◦ Computation can be implemented efficiently using update algorithms;
      entire regression does not have to be rerun for every possible knot (just
      update X‟X matrices)

   Finding k knots simultaneously would require n^k order of
    computations assuming N observations

   To preserve linear problem complexity, multiple knot
    replacement is implemented in a step-wise manner:
    ◦ Need a forward/backward procedure
    ◦ The forward procedure adds knots sequentially one at a time
       The resulting model will have many knots and overfit the training data
    ◦ The backward procedure removes least contributing knots one at a time
       This produces a list of models of varying complexity
    ◦ Using appropriate evaluation criterion, identify the optimal model

   Resulting model will have approximately correct knot locations
   (insert graphs)

   True conditional mean has two knots at X=30
    and X=60, observed data includes additional
    random error

   Best single knot will be at X=45, subsequent best
    locations are true knots around 30 and 60

   The backward elimination step is needed to
    remove the redundant node at X=45
   Thinking in terms of knot selection works very well to
    illustrate splines in one dimension but unwieldy for
    working with a large number of variables simultaneously
    ◦ Need a concise notation easy to program and extend in multiple
      dimensions

    ◦ Need to support interactions, categorical variables, and missing
      values

   Basis functions (BF) provide analytical machinery to
    express the knot placement strategy

   Basis function is a continuous univariate transform that
    reduces predictor influence to a smaller range of values
    controlled by a parameter c (20 in the example below)
    ◦ Direct BF: max(X-c, 0)- the original range is cut below c

    ◦ Mirror BF: max (c-X, 0)- the original range is cut above c
    ◦ (insert graphs)
   The following model represents a 3-knot
    univariate solution for the Boston Housing
    Dataset using two direct and one mirror basis
    functions

   (insert equations)

   All three line segments have negative slope
    even though two coefficients are above zero

   (insert graph)
   MARS core technology:
    ◦ Forward step: add basis function pairs one at a time in conventional step-
      wise forward manner until the largest model size (specified by the user) is
      reached
       Possible collinearity due to redundancy in pairs must be detected and
        eliminated
       For categorical predictors define basis functions as indicator variables for all
        possible subsets of levels
       To support interactions, allow cross products between a new candidate pair
        and basis functions already present in the model

    ◦ Backward step: remove basis functions one at a time in conventional step-
      wise backward manner to obtain a sequence of candidate models
    ◦ Use test sample or cross-validation to identify the optimal model size

   Missing values are treated by constructing missing value
    indicator (MVI) variables and nesting the basis functions within
    the corresponding MVIs

   Fast update formulae and smart computational shortcuts exist to
    make the MARS process as fast and efficient as possible
   OLS and MARS regression (insert graphs)

   We compare the results of classical linear regression
    and MARS
    ◦ Top three significant predictors are shown for each model

    ◦ Linear regression provides global insights

    ◦ MARS regression provides local insights and has superior
      accuracy

      All cut points were automatically discovered by MARS

      MARS model can be presented as a linear regression model in
       the BF space
   One of the oldest Data Mining tools for classification
   The method was originally developed by Fix and Hodges (1951) in an
    unpublished technical report

   Later on it was reproduced by Agrawala (1977), Silverman and Jones
    (1989)
   A review book with many references on the topic is Dasarathy (1991)

   Other books that treat the issue:
    ◦ Ripley B.D. 1996. Pattern Recognition and Neural Networks (chapter 6)
    ◦ Hastie T, Tibshirani R and Friedman J. 2001. The Elements of Statistical Learning Data
      Mining, Inference and Prediction (chapter 13)

   The underlying idea is quite simple: make the predictions by proximity
    or similarity

   Example: we are interested in predicting if a customer will respond to an
    offer. A NN classifier will do the following:
    ◦ Identify a set of people most similar to the customer- the nearest neighbor
    ◦ Observe what they have done in the past on a similar offer
    ◦ Classify by majority voting: if most of them are responders, predict a
      responder, otherwise, predict a non-responder
   (insert graphs)
   Consider binary classification problem

   Want to classify the new case highlighted in yellow

   The circle contains the nearest neighbors (the most similar
    cases)
    ◦ Number of neighbors= 16
    ◦ Votes for blue class= 13
    ◦ Votes for red class= 3

   Classify the new case in the blue class. The estimated probability
    of belonging to the blue class is 13/16=0.8125

   Similarly in this example:
    ◦ Classify the yellow instance in the blue class
    ◦ Classify the green instance in the red class
    ◦ The black point receives three votes from the blue class and another three
      from the red one- the resulting classification is indeterminate
   There are two decisions that should be made in advance before
    applying the NN classifier
    ◦ The shape of the neighborhood
         Answers the question “Who are our nearest neighbors?”
    ◦ The number of neighbors (neighborhood size)
         Answers the question “How many neighbors do we want to consider?”

   Neighborhood shape amounts to choosing the
    proximity/distance measure
    ◦   Manhattan distance
    ◦   Euclidean distance
    ◦   Infinity distance
    ◦   Adaptive distances

   Neighborhood size K can vary between 1 and N (the dataset size)
    ◦ K=1-classification is based on the closest case in the dataset
    ◦ K=N-classification is always to the majority class
    ◦ Thus K acts as a smoothing parameter and can be determined by using a
      test sample or cross-validation
   NN advantages
    ◦ Simple to understand and easy to implement
    ◦ The underlying idea is appealing and makes logical sense
    ◦ Available for both classification and regression problems
       Predictions determined by averaging the values of nearest neighbors
    ◦ Can produce surprisingly accurate results in a number of
      applications
       NN have been proved to perform equal or better than LDA, CART, Neural
        Networks and other approaches when applied to remote sensed data

   NN disadvantages
    ◦ Unlike decision trees, LDA, or logistic regression, their decision
      boundaries are not easy to describe and interpret
    ◦ No variable selection of any kind- vulnerable to noisy inputs
       All the variables have the same weight when computing the distance, so
        two cases could be considered similar (or dissimilar) due to the role of
        irrelevant features (masking effects)
    ◦ Subject to the curse of dimensionality in high dimension datasets
    ◦ The technique is quite time consuming. However, Friedman et. Al.
      (1975 and 1977) have proposed fast algorithms
   Classification and Regression Trees (CART®)- original approach
    based on the “let the data decide local regions” concept
    developed by Breiman, Friedman, Olshen, and Stone in 1984

   The algorithm can be summarized as:
    ◦ For each current data region, consider all possible orthogonal splits (based
      on one variable) into 2 sub-regions
    ◦ The best split is defined as the one having the smallest MSE after fitting a
      constant in each sub-region (regression) or the smallest resulting class
      impurity (classification)
    ◦ Proceed recursively until all structure in the training set has been
      completely exhausted- largest tree is produced
    ◦ Create a sequence of nested sub-trees with different amount of
      localization (tree pruning)
    ◦ Pick the best tree based on the performance on a test set or cross-
      validated

   One can view CART tree as a set of dynamically constructed
    orthogonal nearest neighbor boxes of varying sizes guided by
    the response variable (homogeneity of response within each box)
   CART is best illustrated with a famous example- the UCSD Heart
    Disease study
    ◦ Given the diagnosis of a heart attack based on
       Chest pain, Indicative EKGs, Elevation of enzymes typically released by
        damaged heart muscle, etc.

    ◦ Predict who is at risk of a 2nd heart attack and early death within 30 days

    ◦ Prediction will determine treatment program (intensive care or not)

   For each patient about 100 variables were available, including:
    ◦ Demographics, medical history, lab results

    ◦ 19 noninvasive variables were used in the analysis
       Age, gender, blood pressure, heart rate, etc.

   CART discovered a very useful model utilizing only 3 final
    variables
   (insert classification tree)
   Example of a CLASSIFICATION tree

   Dependent variable is categorical (SURVIVE, DIE)

   The model structure is inherently hierarchical and cannot be represented
    by an equivalent logistic regression equation

   Each terminal node describes a segment in the population

   All internal splits are binary

   Rules can be extracted to describe each terminal node

   Terminal node class assignment is determined by the distribution of the
    target in the node itself

   The tree effectively compresses the decision logic
   CART advantages:
    ◦ One of the fastest data mining algorithms available
    ◦ Requires minimal supervision and produces easy to understand
      models
    ◦ Focuses on finding interactions and signal discontinuities
    ◦ Important variables are automatically identified
    ◦ Handles missing values via surrogate splits
       A surrogate split is an alternative decision rule supporting the main rule
        by exploiting local rank-correlation in a node
    ◦ Invariant to monotone transformations of predictors

   CART disadvantages:
    ◦ Model structure is fundamentally different from conventional
      modeling paradigms- may confuse reviewers and classical
      modelers
    ◦ Has limited number of positions to accommodate available
      predictors- ineffective at presenting global linear structure (but
      great for interactions)
    ◦ Produces coarse-grained piece-wise constant response surfaces
   (insert charts)
   10-node CART tree was built on the cell phone dataset
    introduced earlier

   The root Node 1 displays details of TARGET variable in the
    training data
    ◦ 15.2% of the 830 households accepted the marketing offer

   CART tried all variable predictors one at a time and found out
    that partitioning the set of subjects based on the Handset Price
    variable is most effective at separating responders from non-
    responders at this point
    ◦ Those offered the phone with a price>130 contain only 9.9% responders
    ◦ Those offered a lower price<130 respond at 21.9%

   The process of splitting continues recursively until the largest
    tree is grown

   Subsequent tree pruning eliminates least important branches
    and creates a sequence of nested trees- candidate models
   (insert charts)
   The red nodes indicate good responders while the blue nodes
    indicate poor responders

   Observations with high values on a split variable always go right
    while those with low values go left

   Terminal nodes are numbered left to right and provide the
    following useful insights
    ◦ Node 1: young prospects having very small phone bill, living in specific
      cities are likely to respond to an offer with a cheap handset

    ◦ Node 5: mature prospects having small phone bill, living in specific cities
      (opposite Node1) are likely to respond to an offer with a cheap handset

    ◦ Nodes 6 and 8: prospects with large phone bill are likely to respond as
      long as the handset is cheap

    ◦ Node 10: “high-tech” prospects (having a pager) with large phone bill are
      likely to respond to even offers with expensive handset
   (insert graph, table and chart)

   A number of variables were identified as
    important
    ◦ Note the presence of surrogates not seen on the main
      tree diagram previously

   Prediction Success table reports classification
    accuracy on the test sample

   Top decile (10% of the population with the
    highest scores) captures 40% of the responders
    (lift of 4)
   (insert graphs)

   CART has a powerful mechanism of priors built
    into the core of the tree building mechanism

   Here we report the results of an experiment with
    prior on responders varying from 0.05 to 0.95 in
    increments of 0.05

   The resulting CART models “sweep” the modeling
    space enforcing different sensitivity-specificity
    tradeoff
   As prior on the given class decreases
   The class assignment threshold increases
   Node richness goes up
   But class accuracy goes down

   PRIORS EQUAL uses the root node class ratio as the class assignment
    threshold- hence, most favorable conditions to build a tree

   PRIORS DATA uses the majority rule as the class assignment threshold-
    hence, difficult modeling conditions on unbalanced classes.

   In reality, a proper combination of priors can be found experimentally

   Eventually, when priors are too extreme, CART will refuse to build a tree.
    ◦ Often the hottest spot is a single node in the tree built with the most
      extreme priors with which CART will still build a tree.

    ◦ Comparing hotspots in successive trees can be informative, particularly in
      moderately-sized data sets.
   (insert graph)

   We have a mixture of two overlapping classes

   The vertical lines show root node splits for
    different sets of priors. (the left child is classified
    as red, the right child is classified as blue)

   Varying priors provides effective control over the
    tradeoff between class purity and class accuracy
   Hot spots are areas of data very rich in the event of interest, even
    though they could only cover a small fraction of the targeted
    group
    ◦ A set of prospects rich in responders

    ◦ A set of transactions with abnormal amount of fraud

   The varying-priors collection of runs introduced above gives
    perfect raw material in the search of hot spots
    ◦ Simply look at all terminal nodes across all trees in the collection and
      identify the highest response segments

    ◦ Also want to have such segments as large as possible

    ◦ Once identified, the rules leading to such segments (nodes) are easily
      available

    ◦ (insert graph)
    ◦ The graph on the left reports all nodes according to their target coverage
      and lift
    ◦ The blue curve connects the nodes most likely to be a hot spot
   (insert graph)
   Our next experiment (variable shaving) runs as follows:
    ◦ Build a CART model with the full set of predictors

    ◦ Check the variable importance, remove the least important
      variable and rebuild CART model

    ◦ Repeat previous step until all variables have been removed

   Six-variable model has the best performance so far

   Alternative shaving techniques include:
    ◦ Proceed by removing the most important variable- useful in
      removal of model “hijackers”- variables looking very strong on the
      train data but failing on the test data (e.g. ID variables)

    ◦ Set up nested looping to remove redundant variables from the
      inner positions on the variable importance list
   (insert tree)
   Many predictive models benefit from Salford Systems
    patent on “Structured Trees”

   Trees constrained in how they are grown to reflect
    decision support requirements
    ◦ Variables allowed/disallowed depending on a level in a tree

    ◦ Variable allowed/disallowed depending on a node size

   In mobile phone example: want tree to first segment on
    customer characteristics and then complete using price
    variables
    ◦ Price variables are under the control of the company

    ◦ Customer characteristics are beyond company control
   Various areas of research were spawned by CART

   We report on some of the most interesting and well developed
    approaches

   Hybrid models
    ◦ Combining CART with linear and Logistic Regression
    ◦ Combining CART with Neural Nets

   Linear combination splits

   Committees of trees
    ◦ Bagging
    ◦ Arcing
    ◦ Random Forest

   Stochastic Gradient Boosting (MART a.k.a TreeNet)

   Rule Fit and Path Finder
   (insert images)
   Grow a tree on training data

   Find a way to grow another tree, different from currently
    available (change something in set up)

   Repeat many times, say 500 replications

   Average results or create voting scheme
    ◦ For example, relate PD to fraction of trees predicting default for a given

   Beauty of the method is that every new tree starts with a
    complete set of data

   Any one tree can run out of data, but when that happens we just
    start again with a new tree and all the data (before sampling)
   Have a training set of size N

   Create a new data set of size N by doing sampling with
    replacement from the training set

   The new set (called bootstrap sample) will be different from the
    original:
    ◦   36.5% of the original records are excluded
    ◦   37.5% of the original records are included once
    ◦   18% of the original records are included twice
    ◦   6% of the original records are included three times
    ◦   2% of the original records are included four or more times

   May do this repeatedly to generate numerous bootstrap samples

   Example: distribution of record weights in one realized bootstrap
    sample
   (insert table)
   To generate predicted response, multiple trees are combined via
    voting (classification) or averaging (regression) schemas

   Classification trees “vote”
    ◦ Recall that classification trees classify
       Assign each case to ONE class only

    ◦ With 100 trees, 100 separate class assignment (votes) for each record

    ◦ Winner is the class with the most votes

    ◦ Fraction of votes can be used as a crude approximation to class
      probability

    ◦ Votes could be weighted- say by accuracy of individual trees or node sizes

    ◦ Class weights can be introduced to counter the effects of dominant classes

   Regression trees assign a real predicted value for each case
    ◦ Predictions are combined via averaging

    ◦ Results will be much smoother than from a single tree
   Breiman reports the results of running bootstrap
    aggregation (bagger) on four publicly available
    datasets from Statlog project

   In all cases the bagger shows substantial
    improvement in the classification accuracy

   It all comes at a price of no longer having a
    single interpretable model, substantially longer
    run time and greater demand on model storage
    space

   (insert tables)
   Bagging proceeds by independent, identically-distributed
    sampling draws

   Adaptive resampling: probability that a case is sampled varies
    dynamically
    ◦ Cases with higher current prediction errors have greater probability of
      being sampled in the next round
    ◦ Idea is to focus on these cases most difficult to predict correctly

   Similar procedure first introduced by Freund & Schapire (1996)

   Breiman variant (ARC-x4) is easier to understand:
    ◦ Suppose we have already grown K trees: let m= # times case i was
      misclassified (0≤m≤k) (insert equations)
    ◦ Weight=1 for cases with zero occurrences of misclassification
    ◦ Weight= 1+k^4 for cases with K misclassifications
       Weigh rapidly becomes large is case is difficult to classify
   The results of running bagger and ARCer on the Boston
    Housing Data are reported below

   Bagger shows substantial improvement over the single-
    tree model

   ARCer shows marginal improvement over the bagger
   (insert table)

   Single tree now performs worse than stand alone CART run
    (R-squared=72%) because in bagging we always work with
    exploratory trees only

   Arcing performance beats MARS additive model but is still
    inferior to the MARS interactions model
   Boosting (and Bagging) are very slow and consume a lot of
    memory, the final models tend to be awkwardly large and
    unwieldy

   Boosting in general is vulnerable to overtraining
    ◦ Much better fit on training than on test data

    ◦ Tendency to perform poorly on future data

    ◦ Important to employ additional considerations to reduce overfitting

   Boosting is also highly vulnerable to errors in the data
    ◦ Technique designed to obsess over errors

    ◦ Will keep trying to “learn” patterns to predict miscoded data

    ◦ Ideally would like to be able to identify miscoded and outlying data and
      exclude those records from the learning process

    ◦ Documented in study by Dietterich (1998)
       An Experimental Comparison of Three Methods for Constructing Ensembles
        of Decision Trees, Bagging, Boosting, and Randomization
   New approach for many data analytical tasks developed by
    Leo Breiman of University of California, Berkeley
    ◦ Co-author of CART® with Friedman, Olshen, and Stone

    ◦ Author of Bagging and Arcing approaches to combining trees

   Good for classification and regression problems
    ◦ Also for clustering, density estimation

    ◦ Outlier and anomaly detection

    ◦ Explicit missing value imputation

   Builds on the notions of committees of experts but is
    substantially different in key implementation details
   A random forest is a collection of single trees grown in a
    special way
    ◦ Each tree is grown on a bootstrap sample from the learning set
    ◦ A number R is specified (square root by defualt) such that is
      noticeably smaller than the total number of available predictors
    ◦ During tree growing phase, at each node only R predictors are
      randomly selected and tried

   The overall prediction is determined by voting (in
    classification) or averaging (in regression)

   The law of Large Numbers ensures convergence

   The key to accuracy is low correlation and bias

   To keep bias low, trees are grown to maximum depth
   Randomness is introduced in two distinct ways

   Each tree is grown on a bootstrap sample from the learning set
    ◦ Default bootstrap sample size equals original sample size
    ◦ Smaller bootstrap sample sizes are sometimes useful

   A number R is specified (square root by default) such that it is
    noticeably smaller than the total number of available predictors

   During tree growing phase, at each node only R predictors are
    randomly selected and tried.

   Randomness also reduces the signal to noise ratio in a single
    tree
    ◦ A low correlation between trees is more important than a high signal when
      many trees contribute to forming the model
    ◦ RandomForests™ trees often have very low signal strength, even when the
      signal strength of the forest is high.
   (insert graph)

   Gold- Average of 50 Base Learners

   Blue- Average of 100 Base Learners

   Red- Average of 500 Base Learners
   (insert graph)

   Averaging many base learners improves the
    signal to noise ratio dramatically provided
    that the correlation of errors is kept low

   Hundreds of base learners are needed for the
    most noticeable effect
   All major advantages of a single tree are automatically preserved

   Since each tree is grown on a bootstrap sample, one can
    ◦ Use out of bag samples to compute an unbiased estimate of the accuracy
    ◦ Use out of bag samples to determine variable importances

   There is no overfitting as the number of trees increases

   It is possible to compute generalized proximity between any pair
    of cases

   Based on proximities one can
    ◦   Proceed with a target-driven clustering solution
    ◦   Detect outliers
    ◦   Generate informative data views/projections using scaling coordinates
    ◦   Do missing value imputation

   Interesting approaches to expanding the methodology into
    survival models and the unsupervised learning domain
   RF introduces a novel way to define proximity between two observations:
    ◦ For a dataset of size N define an NXN matrix of proximities

    ◦ Initialize all proximities to zeroes

    ◦ For any given tree, apply the tree to the dataset

    ◦ If case i and case j both end up in the same node, increase proximity
      proxij between i and j by one

    ◦ Accumulate over all trees in RF and normalize by twice the number of trees
      in RF

   The resulting matrix provides intrinsic measure of proximity
    ◦ Observations that are “alike” will have proximities close to one

    ◦ The closer the proximity to 0, the more dissimilar cases i and j are

    ◦ The measure is invariant to monotone transformations

    ◦ The measure is clearly defined for any type of independent
      variables, including categorical
   TreeNet (TN) is a new approach to machine learning and function
    approximation developed by Jerome H, Friedman at Stanford
    University
    ◦ Co-author of CART® with Breiman, Olshen and Stone
    ◦ Author of MARS®, PRIM, Projection Pursuit, COSA, RuleFit™ and more

   Also known as Stochastic Gradient Boosting and MART (Multiple
    Additive Regression Trees)

   Naturally supports the following classes of predictive models
    ◦ Regression (continuous target, LS and LAD loss functions)
    ◦ Binary Classification (binary target, logistic likelihood loss function)
    ◦ Multinomial classification (multiclass target, multinomial likelihood loss
      function)
    ◦ Poisson regression (counting target, Poisson Likelihood loss function)
    ◦ Exponential survival (positive target with censoring)
    ◦ Proportional hazard cox survival model

   TN builds on the notions of committees of experts and boosting
    but is substantially different in key implementation details
   We focus on TreeNet because:
   It is the method introduced in the original Stochastic Gradient
    Boosting article

   It is the method used in many successful real world studies

   We have found it to be more accurate than the other methods
    ◦ Many decisions that affect many people are made using a TreeNet model
    ◦ Major new fraud detection engine uses TreeNet
    ◦ David Cossock of Yahoo recently published a paper on uses of TreeNet in
      web search

   TreeNet is a fully developed methodology. New capabilities
    include:
    ◦   Graphical display of the impact of any predictor
    ◦   New automated ways to test for existence of interactions
    ◦   New ways to identify and rank interactions
    ◦   Ability to constrain model: allow some interactions and disallow others.
    ◦   Method to recast TreeNet model as a logistic regression.
   Built on CART trees and thus
    ◦ Immune to outliers

    ◦ Selects variables

    ◦ Results invariant with monotone transformations of variables

    ◦ Handles missing values automatically

   Resistant to mislabeled target data
    ◦ In medicine cases are commonly misdiagnosed

    ◦ In business, occasionally non-responders flagged as “responders”

   Resistant to overtraining- generalizes very well

   Can be remarkably accurate with little effort

   Trains very rapidly; comparable to CART
   2007 PAKDD competition: home loans up-sell to credit card owners
    2nd place
    ◦ Model built in half a day using previous year submission as a blueprint

   2006 PAKDD competition: customer type discrimination 3rd place
    ◦ Model built in one day. 1st place accuracy 81.9% TreeNet accuracy 81.2%

   2005 BI-CUP Sponsored by University of Chile attracted 60 competitors

   2004 KDDCup “Most Accurate”

   2003 “Duke University/NCR Teradata CRN modeling competition
    ◦ Most Accurate and Best Top Decile Lift on both in and out of time samples

   A major financial services company has tested TreeNet across a
    broad range of targeted marketing and risk models for the past two
    years
    ◦ TreeNet consistently outperforms previous best models (around 10%
      AUROC)
    ◦ TreeNet models can be built in a fraction of the time previously devoted
    ◦ TreeNet reveals previously undetected predictive power in data
   Begin with one very small tree as initial model
    ◦ Could be as small as ONE split generating 2 terminal nodes

    ◦ Typical model will have 3-5 splits in a tree, generating 4-6 terminal nodes

    ◦ Output is a continuous response surface regardless of the target type
       Hence, Probability modeling type for classification

    ◦ Model is intentionally “weak”- shrink all model predictions towards zero
      by multiplying all predictions by a small positive learn rate

   Compute “residuals” for this simple model (prediction error) for
    every record in data
    ◦ The actual definition of the residual in this case is driven by the type of the
      loss function

   Grow second small tree to predict the residuals from first tree

   Continue adding more and more trees until a reasonable amount
    has been added
    ◦ It is important to monitor accuracy on an independent test sample
   (insert chart)
   Trees are kept small (2-6 nodes common)

   Updates are small- can be as small as .01,.001,.0001

   Use random subsets of the training data in each cycle
    ◦ Never train on all the training data in any one cycle

   Highly problematic cases are IGNORED
    ◦ If model prediction starts to diverge substantially from observed data, that
      data will not be used in further updates

   TN allows very flexible control over interactions:
    ◦ Strictly Additive Models (no interactions allowed)

    ◦ Low level interactions allowed

    ◦ High level interactions allowed

    ◦ Constraints: only specific interactions allowed (TN PRO)
   As TN models consist of hundreds or even thousands of trees there is no
    useful way to represent the model via a display of one or two trees

   However, the model can be summarized in a variety of ways
    ◦ Partial Dependency Plots: These exhibit the relationship between the
      target and any predictor- as captured by the model
    ◦ Variable Importance Rankings: These stable rankings give an excellent
      assessment of the relative importance of predictors
    ◦ ROC and Gains Curves: TN Models produce scores that are typically unique
      for each scored record
    ◦ Confusion Matrix: Using an adjustable score threshold this matrix displays
      the model false positive and false negative rates

   TreeNet models based on 2-node trees by definition EXCLUDE interactions
    ◦ Model may be highly nonlinear but is by definition strictly additive
    ◦ Every term in the model is based on a single variable (single split)

   Build TreeNet on a larger tree (default is 6 nodes)
    ◦ Permits up to 5-way interaction but in practice is more like 3-way interaction
   Can conduct informal likelihood ratio test TN(2-node) versus TN(6-
    node)
   Large differences signal important interactions
   (insert graphs)

   The results of running TN on the Boston
    Housing Database are shown

   All of the key insights agree with previous
    findings by MARS and CART
   Slope reverses due to interaction

   Note that the dominant pattern is downward
    sloping, but that a key segment defined by
    the 3rd variable is upward sloping

   (insert graph)
   CART: Model is one optimized Tree
    ◦ Model is easy to interpret as rules
       Can be useful for data exploration, prior to attempting a more complex
         model
    ◦ Model can be applied quickly with a variety of workers:
       A series of questions for phone bank operators to detect fraudulent
         purchases
       Rapid triage in hospital emergency rooms
    ◦ In some cases may produce the best or the most predictive model, for example
      in classification with a barely detectable signal
    ◦ Missing values handled easily and naturally. Can be deployed effectively even
      when new data have a different missingness pattern
   Random Forests: combination of many LARGE trees
    ◦ Unique nonparametric distance metric that works in high dimensional spaces
    ◦ Often predicts well when other models work poorly, e.g. data with high level
      interactions
    ◦ In the most difficult data sets can be the best way to identify important
      variables
   Tree Net: combination of MANY small trees
    ◦ Best overall forecast performance in many cases
    ◦ Constrained models can be used to test the complexity of the data structure
      non-parametrically
    ◦ Exceptionally good with binary targets
   Neural Networks, combination of a few sigmoidal activation
    functions
    ◦ Very complex models can be represented in a very compact form

    ◦ Can accurately forecast both levels and slopes and even higher order
      derivatives

    ◦ Can efficiently use vector dependent variables
       Cross equation constraints can be imposed. (see Symmetry constraints for
        feedforward network models of gradient systems, Cardell, Joerding, and
        Li, IEEE Transactions on Neural Networks, 1993)

    ◦ During deployment phase, forecasts can be computed very quickly
       High voltage transmission lines use a neural network to detect whether there
        has been a lightning strike and are fast enough to shut down the line before
        it can be damaged

   Kernel function estimators, use a local mean or a local regression
    ◦ Local estimates easy to understand and interpret
    ◦ Local regression versions can estimate slopes and levels
    ◦ Initial estimation can be quick
   Random Forests:
    ◦ Models are large, complex and un-interpretable
    ◦ Limited to moderate sample sizes (usually less than 100,000
      observations)
    ◦ Hard to tell in advance which case Random Forests will work well
      on
    ◦ Deployed models require substantial computation

   Tree Net
    ◦ Models are large and complex, interpretation requires additional
      work
    ◦ Deployed models either require substantial computation or post-
      processing of the original model into a more compact form

   CART
    ◦ In most cases models are less accurate than TreeNet
    ◦ Works poorly in cases where effects are approximately linear in
      continuous variables or additive over many variables
   Neural Networks:
    ◦ Neural Networks cover such a wide variety of models that no good widely-
      applicable modeling software exists or may even be possible
       The most dramatic successes have been with Neural Network models that are
        idiosyncratic to the specific case, and ere developed with great effort
       Fully optimized Neural Network parameter estimates can be very difficult to
        compute, and sometimes perform substantially worse than initial statistically
        inferior estimates. (this is called the “over training” issue)

    ◦ In almost all cases initial estimation is very compute intensive

    ◦ Limited to very small numbers of variables (typically between about 6 and
      20 depending on the application)

   Kernel Function Estimators:
    ◦ Deployed models can require substantial computation

    ◦ Limited to small numbers of variables

    ◦ Sensitive to distance measures. Even a modest number of variables can
      degrade performance substantially, due to the influence of relatively
      unimportant variables on the distance metric
   Breiman, L., J. Friedman, R. Olshen and C. Stone
    (1984), Classification and Regression Trees, Pacific Grove:
    Wadsworth
   Breiman, L. (1996). Bagging predictors. Machine
    Learning, 24, 123-140.
   Hastie, T., Tibshirani, R., and Friedman, J.H (2000). The
    Elements of Statistical Learning. Springer.
   Freund, Y. & Schapire, R. E. (1996). Experiments with a
    new boosting algorithm. In L. Saitta, ed., Machine
    Learning: Proceedings of the Thirteenth National
    Conference, Morgan Kaufmann, pp. 148-156.
   Friedman, J.H. (1999). Stochastic gradient boosting.
    Stanford: Statistics Department, Stanford University.
   Friedman, J.H. (1999). Greedy function approximation: a
    gradient boosting machine. Stanford: Statistics
    Department, Stanford University.

Weitere ähnliche Inhalte

Was ist angesagt?

A General Framework for Accurate and Fast Regression by Data Summarization in...
A General Framework for Accurate and Fast Regression by Data Summarization in...A General Framework for Accurate and Fast Regression by Data Summarization in...
A General Framework for Accurate and Fast Regression by Data Summarization in...
Yao Wu
 

Was ist angesagt? (19)

L2. Evaluating Machine Learning Algorithms I
L2. Evaluating Machine Learning Algorithms IL2. Evaluating Machine Learning Algorithms I
L2. Evaluating Machine Learning Algorithms I
 
Eda sri
Eda sriEda sri
Eda sri
 
LR2. Summary Day 2
LR2. Summary Day 2LR2. Summary Day 2
LR2. Summary Day 2
 
Data mining chapter04and5-best
Data mining chapter04and5-bestData mining chapter04and5-best
Data mining chapter04and5-best
 
Clustering
ClusteringClustering
Clustering
 
2.1 Data Mining-classification Basic concepts
2.1 Data Mining-classification Basic concepts2.1 Data Mining-classification Basic concepts
2.1 Data Mining-classification Basic concepts
 
Chapter - 6 Data Mining Concepts and Techniques 2nd Ed slides Han &amp; Kamber
Chapter - 6 Data Mining Concepts and Techniques 2nd Ed slides Han &amp; KamberChapter - 6 Data Mining Concepts and Techniques 2nd Ed slides Han &amp; Kamber
Chapter - 6 Data Mining Concepts and Techniques 2nd Ed slides Han &amp; Kamber
 
Data Analysis: Evaluation Metrics for Supervised Learning Models of Machine L...
Data Analysis: Evaluation Metrics for Supervised Learning Models of Machine L...Data Analysis: Evaluation Metrics for Supervised Learning Models of Machine L...
Data Analysis: Evaluation Metrics for Supervised Learning Models of Machine L...
 
03 Data Mining Techniques
03 Data Mining Techniques03 Data Mining Techniques
03 Data Mining Techniques
 
Knowledge discovery claudiad amato
Knowledge discovery claudiad amatoKnowledge discovery claudiad amato
Knowledge discovery claudiad amato
 
L11. The Future of Machine Learning
L11. The Future of Machine LearningL11. The Future of Machine Learning
L11. The Future of Machine Learning
 
Dma unit 1
Dma unit   1Dma unit   1
Dma unit 1
 
04 Classification in Data Mining
04 Classification in Data Mining04 Classification in Data Mining
04 Classification in Data Mining
 
Tech meetup Data Driven - Codemotion
Tech meetup Data Driven - Codemotion Tech meetup Data Driven - Codemotion
Tech meetup Data Driven - Codemotion
 
4 module 3 --
4 module 3 --4 module 3 --
4 module 3 --
 
A General Framework for Accurate and Fast Regression by Data Summarization in...
A General Framework for Accurate and Fast Regression by Data Summarization in...A General Framework for Accurate and Fast Regression by Data Summarization in...
A General Framework for Accurate and Fast Regression by Data Summarization in...
 
Dma unit 2
Dma unit  2Dma unit  2
Dma unit 2
 
Classification and prediction
Classification and predictionClassification and prediction
Classification and prediction
 
7 decision tree
7 decision tree7 decision tree
7 decision tree
 

Andere mochten auch

Why do People Share Online - Study by the NYTimes
Why do People Share Online - Study by the NYTimesWhy do People Share Online - Study by the NYTimes
Why do People Share Online - Study by the NYTimes
Rick Ramos
 
Five Killer Ways to Design The Same Slide
Five Killer Ways to Design The Same SlideFive Killer Ways to Design The Same Slide
Five Killer Ways to Design The Same Slide
Crispy Presentations
 
Why Content Marketing Fails
Why Content Marketing FailsWhy Content Marketing Fails
Why Content Marketing Fails
Rand Fishkin
 

Andere mochten auch (20)

Why do People Share Online - Study by the NYTimes
Why do People Share Online - Study by the NYTimesWhy do People Share Online - Study by the NYTimes
Why do People Share Online - Study by the NYTimes
 
The Minimum Loveable Product
The Minimum Loveable ProductThe Minimum Loveable Product
The Minimum Loveable Product
 
How I got 2.5 Million views on Slideshare (by @nickdemey - Board of Innovation)
How I got 2.5 Million views on Slideshare (by @nickdemey - Board of Innovation)How I got 2.5 Million views on Slideshare (by @nickdemey - Board of Innovation)
How I got 2.5 Million views on Slideshare (by @nickdemey - Board of Innovation)
 
The Seven Deadly Social Media Sins
The Seven Deadly Social Media SinsThe Seven Deadly Social Media Sins
The Seven Deadly Social Media Sins
 
Five Killer Ways to Design The Same Slide
Five Killer Ways to Design The Same SlideFive Killer Ways to Design The Same Slide
Five Killer Ways to Design The Same Slide
 
How People Really Hold and Touch (their Phones)
How People Really Hold and Touch (their Phones)How People Really Hold and Touch (their Phones)
How People Really Hold and Touch (their Phones)
 
Upworthy: 10 Ways To Win The Internets
Upworthy: 10 Ways To Win The InternetsUpworthy: 10 Ways To Win The Internets
Upworthy: 10 Ways To Win The Internets
 
What 33 Successful Entrepreneurs Learned From Failure
What 33 Successful Entrepreneurs Learned From FailureWhat 33 Successful Entrepreneurs Learned From Failure
What 33 Successful Entrepreneurs Learned From Failure
 
Design Your Career 2018
Design Your Career 2018Design Your Career 2018
Design Your Career 2018
 
Why Content Marketing Fails
Why Content Marketing FailsWhy Content Marketing Fails
Why Content Marketing Fails
 
The History of SEO
The History of SEOThe History of SEO
The History of SEO
 
How To (Really) Get Into Marketing
How To (Really) Get Into MarketingHow To (Really) Get Into Marketing
How To (Really) Get Into Marketing
 
The What If Technique presented by Motivate Design
The What If Technique presented by Motivate DesignThe What If Technique presented by Motivate Design
The What If Technique presented by Motivate Design
 
Displaying Data
Displaying DataDisplaying Data
Displaying Data
 
10 Powerful Body Language Tips for your next Presentation
10 Powerful Body Language Tips for your next Presentation10 Powerful Body Language Tips for your next Presentation
10 Powerful Body Language Tips for your next Presentation
 
Crap. The Content Marketing Deluge.
Crap. The Content Marketing Deluge.Crap. The Content Marketing Deluge.
Crap. The Content Marketing Deluge.
 
What Would Steve Do? 10 Lessons from the World's Most Captivating Presenters
What Would Steve Do? 10 Lessons from the World's Most Captivating PresentersWhat Would Steve Do? 10 Lessons from the World's Most Captivating Presenters
What Would Steve Do? 10 Lessons from the World's Most Captivating Presenters
 
Digital Strategy 101
Digital Strategy 101Digital Strategy 101
Digital Strategy 101
 
How Google Works
How Google WorksHow Google Works
How Google Works
 
The Search for Meaning in B2B Marketing
The Search for Meaning in B2B MarketingThe Search for Meaning in B2B Marketing
The Search for Meaning in B2B Marketing
 

Ähnlich wie Informs presentation new ppt

Introduction to MARS (1999)
Introduction to MARS (1999)Introduction to MARS (1999)
Introduction to MARS (1999)
Salford Systems
 
Chapter 4 Classification in data sience .pdf
Chapter 4 Classification in data sience .pdfChapter 4 Classification in data sience .pdf
Chapter 4 Classification in data sience .pdf
AschalewAyele2
 
Introduction to RandomForests 2004
Introduction to RandomForests 2004Introduction to RandomForests 2004
Introduction to RandomForests 2004
Salford Systems
 
Multi variate presentation
Multi variate presentationMulti variate presentation
Multi variate presentation
Arun Kumar
 

Ähnlich wie Informs presentation new ppt (20)

Machine Learning in the Financial Industry
Machine Learning in the Financial IndustryMachine Learning in the Financial Industry
Machine Learning in the Financial Industry
 
Introduction to Machine Learning
Introduction to Machine LearningIntroduction to Machine Learning
Introduction to Machine Learning
 
Introduction to MARS (1999)
Introduction to MARS (1999)Introduction to MARS (1999)
Introduction to MARS (1999)
 
Identifying and classifying unknown Network Disruption
Identifying and classifying unknown Network DisruptionIdentifying and classifying unknown Network Disruption
Identifying and classifying unknown Network Disruption
 
Pharmacokinetic pharmacodynamic modeling
Pharmacokinetic pharmacodynamic modelingPharmacokinetic pharmacodynamic modeling
Pharmacokinetic pharmacodynamic modeling
 
Choosing a Machine Learning technique to solve your need
Choosing a Machine Learning technique to solve your needChoosing a Machine Learning technique to solve your need
Choosing a Machine Learning technique to solve your need
 
Weka bike rental
Weka bike rentalWeka bike rental
Weka bike rental
 
Chapter 4 Classification in data sience .pdf
Chapter 4 Classification in data sience .pdfChapter 4 Classification in data sience .pdf
Chapter 4 Classification in data sience .pdf
 
Pricing like a data scientist
Pricing like a data scientistPricing like a data scientist
Pricing like a data scientist
 
AI-900 - Fundamental Principles of ML.pptx
AI-900 - Fundamental Principles of ML.pptxAI-900 - Fundamental Principles of ML.pptx
AI-900 - Fundamental Principles of ML.pptx
 
Data mining an introduction
Data mining an introductionData mining an introduction
Data mining an introduction
 
01 Introduction to Data Mining
01 Introduction to Data Mining01 Introduction to Data Mining
01 Introduction to Data Mining
 
Intro to ml_2021
Intro to ml_2021Intro to ml_2021
Intro to ml_2021
 
CART Classification and Regression Trees Experienced User Guide
CART Classification and Regression Trees Experienced User GuideCART Classification and Regression Trees Experienced User Guide
CART Classification and Regression Trees Experienced User Guide
 
Barga Data Science lecture 10
Barga Data Science lecture 10Barga Data Science lecture 10
Barga Data Science lecture 10
 
Introduction to RandomForests 2004
Introduction to RandomForests 2004Introduction to RandomForests 2004
Introduction to RandomForests 2004
 
Multi variate presentation
Multi variate presentationMulti variate presentation
Multi variate presentation
 
Data mining
Data miningData mining
Data mining
 
Diabetes Prediction Using Machine Learning
Diabetes Prediction Using Machine LearningDiabetes Prediction Using Machine Learning
Diabetes Prediction Using Machine Learning
 
PPT for project (1).ppt
PPT for project (1).pptPPT for project (1).ppt
PPT for project (1).ppt
 

Mehr von Salford Systems

Evolution of regression ols to gps to mars
Evolution of regression   ols to gps to marsEvolution of regression   ols to gps to mars
Evolution of regression ols to gps to mars
Salford Systems
 
Comparison of statistical methods commonly used in predictive modeling
Comparison of statistical methods commonly used in predictive modelingComparison of statistical methods commonly used in predictive modeling
Comparison of statistical methods commonly used in predictive modeling
Salford Systems
 
Molecular data mining tool advances in hiv
Molecular data mining tool  advances in hivMolecular data mining tool  advances in hiv
Molecular data mining tool advances in hiv
Salford Systems
 
Hybrid cart logit model 1998
Hybrid cart logit model 1998Hybrid cart logit model 1998
Hybrid cart logit model 1998
Salford Systems
 
TreeNet Overview - Updated October 2012
TreeNet Overview  - Updated October 2012TreeNet Overview  - Updated October 2012
TreeNet Overview - Updated October 2012
Salford Systems
 

Mehr von Salford Systems (20)

Datascience101presentation4
Datascience101presentation4Datascience101presentation4
Datascience101presentation4
 
Improve Your Regression with CART and RandomForests
Improve Your Regression with CART and RandomForestsImprove Your Regression with CART and RandomForests
Improve Your Regression with CART and RandomForests
 
Improved Predictions in Structure Based Drug Design Using Cart and Bayesian M...
Improved Predictions in Structure Based Drug Design Using Cart and Bayesian M...Improved Predictions in Structure Based Drug Design Using Cart and Bayesian M...
Improved Predictions in Structure Based Drug Design Using Cart and Bayesian M...
 
Churn Modeling-For-Mobile-Telecommunications
Churn Modeling-For-Mobile-Telecommunications Churn Modeling-For-Mobile-Telecommunications
Churn Modeling-For-Mobile-Telecommunications
 
The Do's and Don'ts of Data Mining
The Do's and Don'ts of Data MiningThe Do's and Don'ts of Data Mining
The Do's and Don'ts of Data Mining
 
Introduction to Random Forests by Dr. Adele Cutler
Introduction to Random Forests by Dr. Adele CutlerIntroduction to Random Forests by Dr. Adele Cutler
Introduction to Random Forests by Dr. Adele Cutler
 
9 Data Mining Challenges From Data Scientists Like You
9 Data Mining Challenges From Data Scientists Like You9 Data Mining Challenges From Data Scientists Like You
9 Data Mining Challenges From Data Scientists Like You
 
Statistically Significant Quotes To Remember
Statistically Significant Quotes To RememberStatistically Significant Quotes To Remember
Statistically Significant Quotes To Remember
 
Using CART For Beginners with A Teclo Example Dataset
Using CART For Beginners with A Teclo Example DatasetUsing CART For Beginners with A Teclo Example Dataset
Using CART For Beginners with A Teclo Example Dataset
 
Evolution of regression ols to gps to mars
Evolution of regression   ols to gps to marsEvolution of regression   ols to gps to mars
Evolution of regression ols to gps to mars
 
Data Mining for Higher Education
Data Mining for Higher EducationData Mining for Higher Education
Data Mining for Higher Education
 
Comparison of statistical methods commonly used in predictive modeling
Comparison of statistical methods commonly used in predictive modelingComparison of statistical methods commonly used in predictive modeling
Comparison of statistical methods commonly used in predictive modeling
 
Molecular data mining tool advances in hiv
Molecular data mining tool  advances in hivMolecular data mining tool  advances in hiv
Molecular data mining tool advances in hiv
 
TreeNet Tree Ensembles & CART Decision Trees: A Winning Combination
TreeNet Tree Ensembles & CART Decision Trees:  A Winning CombinationTreeNet Tree Ensembles & CART Decision Trees:  A Winning Combination
TreeNet Tree Ensembles & CART Decision Trees: A Winning Combination
 
SPM v7.0 Feature Matrix
SPM v7.0 Feature MatrixSPM v7.0 Feature Matrix
SPM v7.0 Feature Matrix
 
SPM User's Guide: Introducing MARS
SPM User's Guide: Introducing MARSSPM User's Guide: Introducing MARS
SPM User's Guide: Introducing MARS
 
Hybrid cart logit model 1998
Hybrid cart logit model 1998Hybrid cart logit model 1998
Hybrid cart logit model 1998
 
Session Logs Tutorial for SPM
Session Logs Tutorial for SPMSession Logs Tutorial for SPM
Session Logs Tutorial for SPM
 
Some of the new features in SPM 7
Some of the new features in SPM 7Some of the new features in SPM 7
Some of the new features in SPM 7
 
TreeNet Overview - Updated October 2012
TreeNet Overview  - Updated October 2012TreeNet Overview  - Updated October 2012
TreeNet Overview - Updated October 2012
 

Kürzlich hochgeladen

CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Service
giselly40
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
Joaquim Jorge
 

Kürzlich hochgeladen (20)

The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024
 
What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slides
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Service
 
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonets
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CV
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreter
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptx
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 

Informs presentation new ppt

  • 1. Dan Steinberg N Scott Cardell Mykhaylo Golovnya November, 2011 Salford Systems
  • 2.
  • 3. Data Mining Data Mining Cont. • Predictive Analytics • Statistics • OLAP • Machine Learning • Computer science • CART • Pattern Recognition • Database • SVM • Artificial Management • NN Intelligence • Insurance • CRISP-DM • Business • Finance • CRM Intelligence • Marketing • KDD • Data Warehousing • Electrical • Etc. Engineering • Robotics • Biotech and more
  • 4. Data mining is the search for patterns in data using modern highly automated, computer intensive methods ◦ Data mining may be best defined as the use of a specific class of tools (data mining methods) in the analysis of data ◦ The term search is key to this definition, as is “automated”  The literature often refers to finding hidden information in data
  • 5. • Study the phenomenon • Understand its nature Science • Try to discover a law • The laws usually hold for a long time •Collect some data •Guess the model (perhaps, using science) Statistics •Use the data to clarify and/or validate the model •If looks “fishy”, pick another model and do it again • Access to lots of data • No clue what the model might be Data Mining • No long term law is even possible • Let the machine build a model • And let‟s use this model while we can
  • 6. Quest for the Holy Grail- build an algorithm that will always find 100% accurate models  Absolute Powers- data mining will finally find and explain everything  Gold Rush- with the right tool one can rip the stock- market and become obscenely rich  Magic Wand- getting a complete solution from start to finish with a single button push  Doomsday Scenario- all conventional analysts will eventually be replaced by smart computer chips
  • 7. This is known as “supervised learning” ◦ We will focus on patterns that allow us to accomplish two tasks  Classification  Regression  This is known as “unsupervised learning” ◦ We will briefly touch on a third common task  Finding groups in data (clustering, density estimation)  There are other patterns we will not discuss today including ◦ Patterns in sequences ◦ Connections in networks (the web, social networks, link analysis)
  • 8. CART® (Decision Trees, C4.5, CHAID among others)  MARS® (Multivariate Adaptive Regression Splines)  Artificial Neural Networks (ANNs, many commercial)  Association Rules (Clustering, market basket analysis)  TreeNet® (Stochastic Gradient Tree Boosting)  RandomForests® (Ensembles of trees w/ random splits)  Genetic Algorithms (evolutionary model development)  Self Organizing Maps (SOM, like k-means clustering)  Support Vector Machine (SVM wrapped in many patents)  Nearest Neighbor Classifiers
  • 9. (Insert chart)  In a nutshell: Use historical data to gain insights and/or make predictions on the new data
  • 10. Given enough learning iterations, most data mining methods are capable of explaining everything they see in the input data, including noise  Thus one cannot rely on conventional (whole sample) statistical measures of model quality  A common technique is to partition historical data into several mutually exclusive parts ◦ LEARN set is used to build a sequence of models varying in size and level of explained details ◦ TEST set is used to evaluate each candidate model and suggest the optimal one ◦ VALIDATE set is sometimes used to independently confirm the optimal model performance on yet another sample
  • 11. Historical Data Build a Sequence of Learn Models Monitor Test Performance Confirm Validate Findings
  • 12. Analyst needs to indicate where the TEST data is to be found ◦ Stored in a separate file ◦ Selected at random from the available data ◦ Pre-selected from available data and marked by a special indicator  Other things to consider ◦ Population: LEARN and TEST sets come from different populations (within-sample versus out-of-sample) ◦ Time: LEARN and TEST sets come from different time periods (within-time versus out-of-time) ◦ Aggregation: logically grouped records must be all included or all excluded within each set (self-correlation)
  • 13. Any model is built on past data!  Fortunately, many models trace stable patterns of behavior  However, any model will eventually have to be rebuilt: ◦ Banks like to refresh risk models about every 12 months ◦ Targeted marketing models are typically refreshed every 3 months ◦ Ad web-server models may be refreshed every 24 hours  Credit risk score card expert Professor David Hand, University of London maintains: ◦ A predictive model is obsolete the day it is first deployed
  • 14.
  • 15. Model evaluation is at the core of the learning process (choosing the optimal model from a list of candidates)  Model evaluation is also a key part in comparing performance of different algorithms  Finally, model evaluation is needed to continuously monitor model performance over time  In predictive modeling (classification and regression) all we need is a sample of data with known outcome; different evaluation criteria can then be applied  There will never be “the best for all” model; the optimality is contingent upon current evaluation criterion and thus depends on the context in which the model is applied
  • 16. (insert graph)  One usually computes some measure of average discrepancy between the continuous model predictions f and the actual outcome y ◦ Least Squared Deviation: R= Σ(y-f)^2 ◦ Least Absolute Deviation: R= Σ Iy-fI  Fancier definitions also exist ◦ Huber-M Loss: is defined as a hybrid between the LS and LAD losses ◦ SVM Loss: ignores very small discrepancies and then switches to LAD-style  The raw loss value is often re-expressed in relative terms as R-squared
  • 17. There are three progressively more demanding approaches to solving binary classification problems  Division: a model makes the final class assignment for each observation internally ◦ Observations with identical class assignment are no longer discriminated ◦ A model needs to be rebuilt to change decision rules  Rank: a model assigns a continuous score to each observation ◦ The score on its own bears no direct interpretation ◦ But, higher class score means higher likelihood of class presence in general (without precise quantitative statements) ◦ Any monotone transformation of scores is admissible ◦ A spectrum of decision rules can be constructed strictly based on varying score threshold without model rebuilding  Probability: a model assigns a probability score to each observation ◦ Same as above, but the output is interpreted directly in the exact probabilistic terms
  • 18. Depending on the prediction emphasis, various performance evaluation criteria can be constructed for binary classification models  The following list, far from being exhausting, presents some of the frequently used evaluation criteria ◦ Accuracy (more generally- Expected Cost)  Applicable to all models ◦ ROC Curve and Area Under Curve  Not Applicable to Division Models ◦ Gains and Lift  Not Applicable to Division Models ◦ Log-likelihood (a.k.a Cross-Entropy, Deviate)  Not Applicable to Division and Rank Models  The criteria above are listed in the order from the least specific to the most  It is not guaranteed that all criteria will suggest the same model as the optimal from a list of candidate models
  • 19. Most intuitive and also the weakest evaluation method that can be applied to any classification model  Each record must be assigned to a specific class  One first constructs a Prediction Success Table- a 2 by 2 matrix showing how many true 0s and 1s (rows) were classified by the model correctly or incorrectly (columns)  The classification accuracy is then the number of correct class assignments divided by the sample size  More general approaches will also include user supplied prior probabilities and cost matrix to compute the Expected Cost  The example below reports prediction success tables for two separate models along with the accuracy calculations  The method is not sensitive enough to emphasize larger class unbalance in model 1  (insert table)
  • 20. The classification accuracy approach assumes that each record has already been classified which is not always convenient ◦ Those algorithms producing a continuous score (Rank or Probability) will require a user-specified threshold to make final class assignments ◦ Different thresholds will result to different class assignments and likely different classification accuracies  The accuracy approach focuses on the separating boundary and ignores fine probability structure outside the boundary  Ideally, need an evaluator working directly with the score itself and not dependent on any external considerations like costs and thresholds  Also, for Rank models the evaluator needs to be invariant with respect to monotone transformation of the scores so that the “spirit” of such models is not violated
  • 21. The following approach will take full advantage of the set of continuous scores produced by Rank or Probability models  Pick one of the two target classes as the class in focus  Sort a database by predicted score in descending order  Choose a set of different score values ◦ Could be ALL of the unique scores produced by the model ◦ More often a set of scores obtained by binning sorted records into equal size bins  For any fixed value of the score we can now compute: ◦ Sensitivity (a.k.a True Positive): Percent of the class in focus with the predicted scores above the threshold ◦ Specificity (a.k.a False Positive): Percent of the opposite class with the predicted scores below the threshold  We then display the results as a plot of [sensitivity] versus [1-specificity]  The resulting curve is known as the ROC Curve
  • 22. (insert graph)  ROC Curves for three different rank models are shown  No model can be considered as the absolute best in all times  The optimal model selection will rest with the user  Average overall performance can be measured as Area Under ROC Curve (AUC) ◦ ROC Curve (up to orientation) and AUC are invariant with respect to the focus class selection ◦ The best attainable AUS is always 1.0 ◦ AUC of a model with randomly assigned scores is 0.5  AUC can be interpreted ◦ Suppose we randomly and repeatedly pick one observation at random from the focus class and another observation from the opposite class ◦ Then AUC is the fraction of trials resulting to the focus class observation having greater predicted score than the opposite class observation ◦ AUC below 0.5 means that something is fundamentally wrong
  • 23. The following example justifies another slightly different approach to model evaluation  Suppose we want to mail a certain offer to P fraction of the population  Mailing to a randomly chosen sample will capture about P fraction of the responders (random sampling procedure)  Now suppose that we have access to a response model which ranks each potential responder by a score  Now if we sample the P fraction of the population targeting members with the highest predicted scores first (model guided sampling), we could now get T fraction of the responders which we expect to be higher than P  The lift in P(th) percentile is defined as the ratio T/P  Obviously, meaningful models will always produce lift greater than 1  The process can be repeated for all possible percentiles and the results can be summarized graphically as Gains and Cumulative Lift curves  In practice, one usually first sorts observations by scores and then partitions sorted data into a fixed number of bins to save on calculations just like it is usually done for ROC curves
  • 24. (insert graphs and tables)
  • 25. (insert graphs)  Lift in the given percentile provides a point measure of performance for the given population cutoff ◦ Can be viewed as the relative length of the vertical line segment connecting the gains curve at the given population cutoff  Area Under the Gains curve (AUG): Provides an integral measure of performance across all bins ◦ Unlike AUC, the largest attainable value of AUG is (1-p/2), P being the fraction of responders in the population  Just like ROC-curves, gains and lift curves for different models can intersect, so that performance-wise one model is better for one range of cutoffs while another model is better for a different range  Unlike ROC-curve, gains and lift curves do depend on the class in focus ◦ For the dominant class, gains and lift curves degenerate to the trivial 45- degree line random case
  • 26. ROC, Gains, and lift curves together with AUC and AUG are invariant with respect to monotone transformation of the model scores ◦ Scores are only used to sort records in the evaluation set, the actual score values are of no consequence  All these measures address the same conceptual phenomenon emphasizing different sides and thus can be easily derived from each other ◦ Any point (P,G) on a gains curve corresponds to the point (P,G/P) on the lift curve ◦ Suppose that the focus class occupies fraction F of the population; then any point (P,G) on a gains curve corresponds to the point {(P-FG)/(1-F),G} on the ROC curve  It follows that the ROC graph “pushes” the gains graph “away” from the 45 degree line  Dominant focus class (large F) is “pushed” harder so that the degeneracy of its gain curve disappears  In contrast, rare focus class (small F) has ROC curve naturally “close” to the gains curve  All of these measures are widely used as robust performance evaluations in various practical applications
  • 27. When the output score can be interpreted as probability, a more specific evaluation criterion can be constructed to access probabilistic accuracy of the model  We assume that the model generates P(X)-the conditional probability of 1 given X  We also assume that the binary target Y is coded as -1 and +1 (only for notational convenience)  The Cross-Entropy (CXE) criterion is then computed as (insert equation) ◦ The inner Log computes the log-odds of Y=1 ◦ The value itself is the negative log-likelihood assuming independence of responses ◦ Alternative notation assumes 0/1 target coding and uses the following formula (insert equation) ◦ The values produced by either of the formula will be identical to each other  Model with the smallest CXE means the largest likelihood and thus considered to be the best in terms of capturing the right probability structure
  • 28. The example shows true non-monotonic conditional probability (dark blue curve)  We generated 5,000 LEARN and TEST observations based on this probability model  We report predicted responses generated by different modeling approaches ◦ Red- best accuracy MART model ◦ Yellow- best CXE MART model ◦ Cyan- univariate LOGIT model  Performance-wise ◦ All models have identical accuracy but the best accuracy model is substantially worse in terms of CXE ◦ LOGIT can‟t capture departure from monotonicity as reported by CXE
  • 29.
  • 30. MARS is a highly-automated tool for regression  Developed by Jerome H. Friedman of Stanford University ◦ Annals of statistics, 1991 dense 65 page article ◦ Takes some inspiration from its ancestor CART® ◦ Produces smooth curves and surfaces, not the step-functions of CART  Appropriate target variables are continuous  End result of a MARS run is a regression model ◦ MARS automatically chooses which variables to use ◦ Variables are optimally transformed ◦ Interactions are detected ◦ Model is self-tested to protect against over-fitting  Can also perform well on binary dependent variables ◦ Censored survival model (waiting time models as in churn)
  • 31. Harrison, D. and D. Rubinfeld. Hedonic Housing Prices and Demand for Clean Air. Journal of Environmental Economics and Management v5, 81-102, 1978  506 census tracts in city of Boston for the year 1970  Goal: study relationship between quality of life variables and property values ◦ MV- median value of owner-occupied homes in tract („000s) ◦ CRIM- per capita crime rates ◦ NOX- concentration of nitrogen oxides (pphm) ◦ AGE- percent built before 1940 ◦ DIS- weighted distance to centers of employment ◦ RM- average number of rooms per house ◦ LSTAT- percent neighborhood „lower socio-economic status‟ ◦ RAD- accessibility to radial highways ◦ CHAS- borders Charles River (0/1) ◦ INDUS- percent non-retail business ◦ TAX- tax rate ◦ PT- pupil teacher ratio
  • 32. (insert graph)  The dataset poses significant challenges to conventional regression modeling ◦ Clearly departure from normality, non-linear relationships, and skewed distributions ◦ Multicollinearity, mutual dependency, and outlying observations
  • 33. (insert graph)  A typical MARS solution (univariate for simplicity) is shown above ◦ Essentially a piece-wise linear regression model with the continuity requirement at the transition points called knots ◦ The locations and number of knots were determined automatically to ensure the best possible model fit ◦ The solution can be analytically expressed as conventional regression equations
  • 34. Finding the one best knot in a simple regression is a straightforward search problem ◦ Try a large number of potential knots and choose one with the best R- squared ◦ Computation can be implemented efficiently using update algorithms; entire regression does not have to be rerun for every possible knot (just update X‟X matrices)  Finding k knots simultaneously would require n^k order of computations assuming N observations  To preserve linear problem complexity, multiple knot replacement is implemented in a step-wise manner: ◦ Need a forward/backward procedure ◦ The forward procedure adds knots sequentially one at a time  The resulting model will have many knots and overfit the training data ◦ The backward procedure removes least contributing knots one at a time  This produces a list of models of varying complexity ◦ Using appropriate evaluation criterion, identify the optimal model  Resulting model will have approximately correct knot locations
  • 35. (insert graphs)  True conditional mean has two knots at X=30 and X=60, observed data includes additional random error  Best single knot will be at X=45, subsequent best locations are true knots around 30 and 60  The backward elimination step is needed to remove the redundant node at X=45
  • 36. Thinking in terms of knot selection works very well to illustrate splines in one dimension but unwieldy for working with a large number of variables simultaneously ◦ Need a concise notation easy to program and extend in multiple dimensions ◦ Need to support interactions, categorical variables, and missing values  Basis functions (BF) provide analytical machinery to express the knot placement strategy  Basis function is a continuous univariate transform that reduces predictor influence to a smaller range of values controlled by a parameter c (20 in the example below) ◦ Direct BF: max(X-c, 0)- the original range is cut below c ◦ Mirror BF: max (c-X, 0)- the original range is cut above c ◦ (insert graphs)
  • 37. The following model represents a 3-knot univariate solution for the Boston Housing Dataset using two direct and one mirror basis functions  (insert equations)  All three line segments have negative slope even though two coefficients are above zero  (insert graph)
  • 38. MARS core technology: ◦ Forward step: add basis function pairs one at a time in conventional step- wise forward manner until the largest model size (specified by the user) is reached  Possible collinearity due to redundancy in pairs must be detected and eliminated  For categorical predictors define basis functions as indicator variables for all possible subsets of levels  To support interactions, allow cross products between a new candidate pair and basis functions already present in the model ◦ Backward step: remove basis functions one at a time in conventional step- wise backward manner to obtain a sequence of candidate models ◦ Use test sample or cross-validation to identify the optimal model size  Missing values are treated by constructing missing value indicator (MVI) variables and nesting the basis functions within the corresponding MVIs  Fast update formulae and smart computational shortcuts exist to make the MARS process as fast and efficient as possible
  • 39. OLS and MARS regression (insert graphs)  We compare the results of classical linear regression and MARS ◦ Top three significant predictors are shown for each model ◦ Linear regression provides global insights ◦ MARS regression provides local insights and has superior accuracy  All cut points were automatically discovered by MARS  MARS model can be presented as a linear regression model in the BF space
  • 40.
  • 41. One of the oldest Data Mining tools for classification  The method was originally developed by Fix and Hodges (1951) in an unpublished technical report  Later on it was reproduced by Agrawala (1977), Silverman and Jones (1989)  A review book with many references on the topic is Dasarathy (1991)  Other books that treat the issue: ◦ Ripley B.D. 1996. Pattern Recognition and Neural Networks (chapter 6) ◦ Hastie T, Tibshirani R and Friedman J. 2001. The Elements of Statistical Learning Data Mining, Inference and Prediction (chapter 13)  The underlying idea is quite simple: make the predictions by proximity or similarity  Example: we are interested in predicting if a customer will respond to an offer. A NN classifier will do the following: ◦ Identify a set of people most similar to the customer- the nearest neighbor ◦ Observe what they have done in the past on a similar offer ◦ Classify by majority voting: if most of them are responders, predict a responder, otherwise, predict a non-responder
  • 42. (insert graphs)  Consider binary classification problem  Want to classify the new case highlighted in yellow  The circle contains the nearest neighbors (the most similar cases) ◦ Number of neighbors= 16 ◦ Votes for blue class= 13 ◦ Votes for red class= 3  Classify the new case in the blue class. The estimated probability of belonging to the blue class is 13/16=0.8125  Similarly in this example: ◦ Classify the yellow instance in the blue class ◦ Classify the green instance in the red class ◦ The black point receives three votes from the blue class and another three from the red one- the resulting classification is indeterminate
  • 43. There are two decisions that should be made in advance before applying the NN classifier ◦ The shape of the neighborhood  Answers the question “Who are our nearest neighbors?” ◦ The number of neighbors (neighborhood size)  Answers the question “How many neighbors do we want to consider?”  Neighborhood shape amounts to choosing the proximity/distance measure ◦ Manhattan distance ◦ Euclidean distance ◦ Infinity distance ◦ Adaptive distances  Neighborhood size K can vary between 1 and N (the dataset size) ◦ K=1-classification is based on the closest case in the dataset ◦ K=N-classification is always to the majority class ◦ Thus K acts as a smoothing parameter and can be determined by using a test sample or cross-validation
  • 44. NN advantages ◦ Simple to understand and easy to implement ◦ The underlying idea is appealing and makes logical sense ◦ Available for both classification and regression problems  Predictions determined by averaging the values of nearest neighbors ◦ Can produce surprisingly accurate results in a number of applications  NN have been proved to perform equal or better than LDA, CART, Neural Networks and other approaches when applied to remote sensed data  NN disadvantages ◦ Unlike decision trees, LDA, or logistic regression, their decision boundaries are not easy to describe and interpret ◦ No variable selection of any kind- vulnerable to noisy inputs  All the variables have the same weight when computing the distance, so two cases could be considered similar (or dissimilar) due to the role of irrelevant features (masking effects) ◦ Subject to the curse of dimensionality in high dimension datasets ◦ The technique is quite time consuming. However, Friedman et. Al. (1975 and 1977) have proposed fast algorithms
  • 45.
  • 46. Classification and Regression Trees (CART®)- original approach based on the “let the data decide local regions” concept developed by Breiman, Friedman, Olshen, and Stone in 1984  The algorithm can be summarized as: ◦ For each current data region, consider all possible orthogonal splits (based on one variable) into 2 sub-regions ◦ The best split is defined as the one having the smallest MSE after fitting a constant in each sub-region (regression) or the smallest resulting class impurity (classification) ◦ Proceed recursively until all structure in the training set has been completely exhausted- largest tree is produced ◦ Create a sequence of nested sub-trees with different amount of localization (tree pruning) ◦ Pick the best tree based on the performance on a test set or cross- validated  One can view CART tree as a set of dynamically constructed orthogonal nearest neighbor boxes of varying sizes guided by the response variable (homogeneity of response within each box)
  • 47. CART is best illustrated with a famous example- the UCSD Heart Disease study ◦ Given the diagnosis of a heart attack based on  Chest pain, Indicative EKGs, Elevation of enzymes typically released by damaged heart muscle, etc. ◦ Predict who is at risk of a 2nd heart attack and early death within 30 days ◦ Prediction will determine treatment program (intensive care or not)  For each patient about 100 variables were available, including: ◦ Demographics, medical history, lab results ◦ 19 noninvasive variables were used in the analysis  Age, gender, blood pressure, heart rate, etc.  CART discovered a very useful model utilizing only 3 final variables
  • 48. (insert classification tree)  Example of a CLASSIFICATION tree  Dependent variable is categorical (SURVIVE, DIE)  The model structure is inherently hierarchical and cannot be represented by an equivalent logistic regression equation  Each terminal node describes a segment in the population  All internal splits are binary  Rules can be extracted to describe each terminal node  Terminal node class assignment is determined by the distribution of the target in the node itself  The tree effectively compresses the decision logic
  • 49. CART advantages: ◦ One of the fastest data mining algorithms available ◦ Requires minimal supervision and produces easy to understand models ◦ Focuses on finding interactions and signal discontinuities ◦ Important variables are automatically identified ◦ Handles missing values via surrogate splits  A surrogate split is an alternative decision rule supporting the main rule by exploiting local rank-correlation in a node ◦ Invariant to monotone transformations of predictors  CART disadvantages: ◦ Model structure is fundamentally different from conventional modeling paradigms- may confuse reviewers and classical modelers ◦ Has limited number of positions to accommodate available predictors- ineffective at presenting global linear structure (but great for interactions) ◦ Produces coarse-grained piece-wise constant response surfaces
  • 50. (insert charts)  10-node CART tree was built on the cell phone dataset introduced earlier  The root Node 1 displays details of TARGET variable in the training data ◦ 15.2% of the 830 households accepted the marketing offer  CART tried all variable predictors one at a time and found out that partitioning the set of subjects based on the Handset Price variable is most effective at separating responders from non- responders at this point ◦ Those offered the phone with a price>130 contain only 9.9% responders ◦ Those offered a lower price<130 respond at 21.9%  The process of splitting continues recursively until the largest tree is grown  Subsequent tree pruning eliminates least important branches and creates a sequence of nested trees- candidate models
  • 51. (insert charts)  The red nodes indicate good responders while the blue nodes indicate poor responders  Observations with high values on a split variable always go right while those with low values go left  Terminal nodes are numbered left to right and provide the following useful insights ◦ Node 1: young prospects having very small phone bill, living in specific cities are likely to respond to an offer with a cheap handset ◦ Node 5: mature prospects having small phone bill, living in specific cities (opposite Node1) are likely to respond to an offer with a cheap handset ◦ Nodes 6 and 8: prospects with large phone bill are likely to respond as long as the handset is cheap ◦ Node 10: “high-tech” prospects (having a pager) with large phone bill are likely to respond to even offers with expensive handset
  • 52. (insert graph, table and chart)  A number of variables were identified as important ◦ Note the presence of surrogates not seen on the main tree diagram previously  Prediction Success table reports classification accuracy on the test sample  Top decile (10% of the population with the highest scores) captures 40% of the responders (lift of 4)
  • 53. (insert graphs)  CART has a powerful mechanism of priors built into the core of the tree building mechanism  Here we report the results of an experiment with prior on responders varying from 0.05 to 0.95 in increments of 0.05  The resulting CART models “sweep” the modeling space enforcing different sensitivity-specificity tradeoff
  • 54. As prior on the given class decreases  The class assignment threshold increases  Node richness goes up  But class accuracy goes down  PRIORS EQUAL uses the root node class ratio as the class assignment threshold- hence, most favorable conditions to build a tree  PRIORS DATA uses the majority rule as the class assignment threshold- hence, difficult modeling conditions on unbalanced classes.  In reality, a proper combination of priors can be found experimentally  Eventually, when priors are too extreme, CART will refuse to build a tree. ◦ Often the hottest spot is a single node in the tree built with the most extreme priors with which CART will still build a tree. ◦ Comparing hotspots in successive trees can be informative, particularly in moderately-sized data sets.
  • 55. (insert graph)  We have a mixture of two overlapping classes  The vertical lines show root node splits for different sets of priors. (the left child is classified as red, the right child is classified as blue)  Varying priors provides effective control over the tradeoff between class purity and class accuracy
  • 56. Hot spots are areas of data very rich in the event of interest, even though they could only cover a small fraction of the targeted group ◦ A set of prospects rich in responders ◦ A set of transactions with abnormal amount of fraud  The varying-priors collection of runs introduced above gives perfect raw material in the search of hot spots ◦ Simply look at all terminal nodes across all trees in the collection and identify the highest response segments ◦ Also want to have such segments as large as possible ◦ Once identified, the rules leading to such segments (nodes) are easily available ◦ (insert graph) ◦ The graph on the left reports all nodes according to their target coverage and lift ◦ The blue curve connects the nodes most likely to be a hot spot
  • 57. (insert graph)  Our next experiment (variable shaving) runs as follows: ◦ Build a CART model with the full set of predictors ◦ Check the variable importance, remove the least important variable and rebuild CART model ◦ Repeat previous step until all variables have been removed  Six-variable model has the best performance so far  Alternative shaving techniques include: ◦ Proceed by removing the most important variable- useful in removal of model “hijackers”- variables looking very strong on the train data but failing on the test data (e.g. ID variables) ◦ Set up nested looping to remove redundant variables from the inner positions on the variable importance list
  • 58. (insert tree)  Many predictive models benefit from Salford Systems patent on “Structured Trees”  Trees constrained in how they are grown to reflect decision support requirements ◦ Variables allowed/disallowed depending on a level in a tree ◦ Variable allowed/disallowed depending on a node size  In mobile phone example: want tree to first segment on customer characteristics and then complete using price variables ◦ Price variables are under the control of the company ◦ Customer characteristics are beyond company control
  • 59. Various areas of research were spawned by CART  We report on some of the most interesting and well developed approaches  Hybrid models ◦ Combining CART with linear and Logistic Regression ◦ Combining CART with Neural Nets  Linear combination splits  Committees of trees ◦ Bagging ◦ Arcing ◦ Random Forest  Stochastic Gradient Boosting (MART a.k.a TreeNet)  Rule Fit and Path Finder
  • 60.
  • 61. (insert images)  Grow a tree on training data  Find a way to grow another tree, different from currently available (change something in set up)  Repeat many times, say 500 replications  Average results or create voting scheme ◦ For example, relate PD to fraction of trees predicting default for a given  Beauty of the method is that every new tree starts with a complete set of data  Any one tree can run out of data, but when that happens we just start again with a new tree and all the data (before sampling)
  • 62. Have a training set of size N  Create a new data set of size N by doing sampling with replacement from the training set  The new set (called bootstrap sample) will be different from the original: ◦ 36.5% of the original records are excluded ◦ 37.5% of the original records are included once ◦ 18% of the original records are included twice ◦ 6% of the original records are included three times ◦ 2% of the original records are included four or more times  May do this repeatedly to generate numerous bootstrap samples  Example: distribution of record weights in one realized bootstrap sample  (insert table)
  • 63. To generate predicted response, multiple trees are combined via voting (classification) or averaging (regression) schemas  Classification trees “vote” ◦ Recall that classification trees classify  Assign each case to ONE class only ◦ With 100 trees, 100 separate class assignment (votes) for each record ◦ Winner is the class with the most votes ◦ Fraction of votes can be used as a crude approximation to class probability ◦ Votes could be weighted- say by accuracy of individual trees or node sizes ◦ Class weights can be introduced to counter the effects of dominant classes  Regression trees assign a real predicted value for each case ◦ Predictions are combined via averaging ◦ Results will be much smoother than from a single tree
  • 64. Breiman reports the results of running bootstrap aggregation (bagger) on four publicly available datasets from Statlog project  In all cases the bagger shows substantial improvement in the classification accuracy  It all comes at a price of no longer having a single interpretable model, substantially longer run time and greater demand on model storage space  (insert tables)
  • 65. Bagging proceeds by independent, identically-distributed sampling draws  Adaptive resampling: probability that a case is sampled varies dynamically ◦ Cases with higher current prediction errors have greater probability of being sampled in the next round ◦ Idea is to focus on these cases most difficult to predict correctly  Similar procedure first introduced by Freund & Schapire (1996)  Breiman variant (ARC-x4) is easier to understand: ◦ Suppose we have already grown K trees: let m= # times case i was misclassified (0≤m≤k) (insert equations) ◦ Weight=1 for cases with zero occurrences of misclassification ◦ Weight= 1+k^4 for cases with K misclassifications  Weigh rapidly becomes large is case is difficult to classify
  • 66. The results of running bagger and ARCer on the Boston Housing Data are reported below  Bagger shows substantial improvement over the single- tree model  ARCer shows marginal improvement over the bagger  (insert table)  Single tree now performs worse than stand alone CART run (R-squared=72%) because in bagging we always work with exploratory trees only  Arcing performance beats MARS additive model but is still inferior to the MARS interactions model
  • 67. Boosting (and Bagging) are very slow and consume a lot of memory, the final models tend to be awkwardly large and unwieldy  Boosting in general is vulnerable to overtraining ◦ Much better fit on training than on test data ◦ Tendency to perform poorly on future data ◦ Important to employ additional considerations to reduce overfitting  Boosting is also highly vulnerable to errors in the data ◦ Technique designed to obsess over errors ◦ Will keep trying to “learn” patterns to predict miscoded data ◦ Ideally would like to be able to identify miscoded and outlying data and exclude those records from the learning process ◦ Documented in study by Dietterich (1998)  An Experimental Comparison of Three Methods for Constructing Ensembles of Decision Trees, Bagging, Boosting, and Randomization
  • 68.
  • 69. New approach for many data analytical tasks developed by Leo Breiman of University of California, Berkeley ◦ Co-author of CART® with Friedman, Olshen, and Stone ◦ Author of Bagging and Arcing approaches to combining trees  Good for classification and regression problems ◦ Also for clustering, density estimation ◦ Outlier and anomaly detection ◦ Explicit missing value imputation  Builds on the notions of committees of experts but is substantially different in key implementation details
  • 70. A random forest is a collection of single trees grown in a special way ◦ Each tree is grown on a bootstrap sample from the learning set ◦ A number R is specified (square root by defualt) such that is noticeably smaller than the total number of available predictors ◦ During tree growing phase, at each node only R predictors are randomly selected and tried  The overall prediction is determined by voting (in classification) or averaging (in regression)  The law of Large Numbers ensures convergence  The key to accuracy is low correlation and bias  To keep bias low, trees are grown to maximum depth
  • 71. Randomness is introduced in two distinct ways  Each tree is grown on a bootstrap sample from the learning set ◦ Default bootstrap sample size equals original sample size ◦ Smaller bootstrap sample sizes are sometimes useful  A number R is specified (square root by default) such that it is noticeably smaller than the total number of available predictors  During tree growing phase, at each node only R predictors are randomly selected and tried.  Randomness also reduces the signal to noise ratio in a single tree ◦ A low correlation between trees is more important than a high signal when many trees contribute to forming the model ◦ RandomForests™ trees often have very low signal strength, even when the signal strength of the forest is high.
  • 72. (insert graph)  Gold- Average of 50 Base Learners  Blue- Average of 100 Base Learners  Red- Average of 500 Base Learners
  • 73. (insert graph)  Averaging many base learners improves the signal to noise ratio dramatically provided that the correlation of errors is kept low  Hundreds of base learners are needed for the most noticeable effect
  • 74. All major advantages of a single tree are automatically preserved  Since each tree is grown on a bootstrap sample, one can ◦ Use out of bag samples to compute an unbiased estimate of the accuracy ◦ Use out of bag samples to determine variable importances  There is no overfitting as the number of trees increases  It is possible to compute generalized proximity between any pair of cases  Based on proximities one can ◦ Proceed with a target-driven clustering solution ◦ Detect outliers ◦ Generate informative data views/projections using scaling coordinates ◦ Do missing value imputation  Interesting approaches to expanding the methodology into survival models and the unsupervised learning domain
  • 75. RF introduces a novel way to define proximity between two observations: ◦ For a dataset of size N define an NXN matrix of proximities ◦ Initialize all proximities to zeroes ◦ For any given tree, apply the tree to the dataset ◦ If case i and case j both end up in the same node, increase proximity proxij between i and j by one ◦ Accumulate over all trees in RF and normalize by twice the number of trees in RF  The resulting matrix provides intrinsic measure of proximity ◦ Observations that are “alike” will have proximities close to one ◦ The closer the proximity to 0, the more dissimilar cases i and j are ◦ The measure is invariant to monotone transformations ◦ The measure is clearly defined for any type of independent variables, including categorical
  • 76.
  • 77. TreeNet (TN) is a new approach to machine learning and function approximation developed by Jerome H, Friedman at Stanford University ◦ Co-author of CART® with Breiman, Olshen and Stone ◦ Author of MARS®, PRIM, Projection Pursuit, COSA, RuleFit™ and more  Also known as Stochastic Gradient Boosting and MART (Multiple Additive Regression Trees)  Naturally supports the following classes of predictive models ◦ Regression (continuous target, LS and LAD loss functions) ◦ Binary Classification (binary target, logistic likelihood loss function) ◦ Multinomial classification (multiclass target, multinomial likelihood loss function) ◦ Poisson regression (counting target, Poisson Likelihood loss function) ◦ Exponential survival (positive target with censoring) ◦ Proportional hazard cox survival model  TN builds on the notions of committees of experts and boosting but is substantially different in key implementation details
  • 78. We focus on TreeNet because:  It is the method introduced in the original Stochastic Gradient Boosting article  It is the method used in many successful real world studies  We have found it to be more accurate than the other methods ◦ Many decisions that affect many people are made using a TreeNet model ◦ Major new fraud detection engine uses TreeNet ◦ David Cossock of Yahoo recently published a paper on uses of TreeNet in web search  TreeNet is a fully developed methodology. New capabilities include: ◦ Graphical display of the impact of any predictor ◦ New automated ways to test for existence of interactions ◦ New ways to identify and rank interactions ◦ Ability to constrain model: allow some interactions and disallow others. ◦ Method to recast TreeNet model as a logistic regression.
  • 79. Built on CART trees and thus ◦ Immune to outliers ◦ Selects variables ◦ Results invariant with monotone transformations of variables ◦ Handles missing values automatically  Resistant to mislabeled target data ◦ In medicine cases are commonly misdiagnosed ◦ In business, occasionally non-responders flagged as “responders”  Resistant to overtraining- generalizes very well  Can be remarkably accurate with little effort  Trains very rapidly; comparable to CART
  • 80. 2007 PAKDD competition: home loans up-sell to credit card owners 2nd place ◦ Model built in half a day using previous year submission as a blueprint  2006 PAKDD competition: customer type discrimination 3rd place ◦ Model built in one day. 1st place accuracy 81.9% TreeNet accuracy 81.2%  2005 BI-CUP Sponsored by University of Chile attracted 60 competitors  2004 KDDCup “Most Accurate”  2003 “Duke University/NCR Teradata CRN modeling competition ◦ Most Accurate and Best Top Decile Lift on both in and out of time samples  A major financial services company has tested TreeNet across a broad range of targeted marketing and risk models for the past two years ◦ TreeNet consistently outperforms previous best models (around 10% AUROC) ◦ TreeNet models can be built in a fraction of the time previously devoted ◦ TreeNet reveals previously undetected predictive power in data
  • 81. Begin with one very small tree as initial model ◦ Could be as small as ONE split generating 2 terminal nodes ◦ Typical model will have 3-5 splits in a tree, generating 4-6 terminal nodes ◦ Output is a continuous response surface regardless of the target type  Hence, Probability modeling type for classification ◦ Model is intentionally “weak”- shrink all model predictions towards zero by multiplying all predictions by a small positive learn rate  Compute “residuals” for this simple model (prediction error) for every record in data ◦ The actual definition of the residual in this case is driven by the type of the loss function  Grow second small tree to predict the residuals from first tree  Continue adding more and more trees until a reasonable amount has been added ◦ It is important to monitor accuracy on an independent test sample
  • 82. (insert chart)
  • 83. Trees are kept small (2-6 nodes common)  Updates are small- can be as small as .01,.001,.0001  Use random subsets of the training data in each cycle ◦ Never train on all the training data in any one cycle  Highly problematic cases are IGNORED ◦ If model prediction starts to diverge substantially from observed data, that data will not be used in further updates  TN allows very flexible control over interactions: ◦ Strictly Additive Models (no interactions allowed) ◦ Low level interactions allowed ◦ High level interactions allowed ◦ Constraints: only specific interactions allowed (TN PRO)
  • 84. As TN models consist of hundreds or even thousands of trees there is no useful way to represent the model via a display of one or two trees  However, the model can be summarized in a variety of ways ◦ Partial Dependency Plots: These exhibit the relationship between the target and any predictor- as captured by the model ◦ Variable Importance Rankings: These stable rankings give an excellent assessment of the relative importance of predictors ◦ ROC and Gains Curves: TN Models produce scores that are typically unique for each scored record ◦ Confusion Matrix: Using an adjustable score threshold this matrix displays the model false positive and false negative rates  TreeNet models based on 2-node trees by definition EXCLUDE interactions ◦ Model may be highly nonlinear but is by definition strictly additive ◦ Every term in the model is based on a single variable (single split)  Build TreeNet on a larger tree (default is 6 nodes) ◦ Permits up to 5-way interaction but in practice is more like 3-way interaction  Can conduct informal likelihood ratio test TN(2-node) versus TN(6- node)  Large differences signal important interactions
  • 85. (insert graphs)  The results of running TN on the Boston Housing Database are shown  All of the key insights agree with previous findings by MARS and CART
  • 86. Slope reverses due to interaction  Note that the dominant pattern is downward sloping, but that a key segment defined by the 3rd variable is upward sloping  (insert graph)
  • 87.
  • 88. CART: Model is one optimized Tree ◦ Model is easy to interpret as rules  Can be useful for data exploration, prior to attempting a more complex model ◦ Model can be applied quickly with a variety of workers:  A series of questions for phone bank operators to detect fraudulent purchases  Rapid triage in hospital emergency rooms ◦ In some cases may produce the best or the most predictive model, for example in classification with a barely detectable signal ◦ Missing values handled easily and naturally. Can be deployed effectively even when new data have a different missingness pattern  Random Forests: combination of many LARGE trees ◦ Unique nonparametric distance metric that works in high dimensional spaces ◦ Often predicts well when other models work poorly, e.g. data with high level interactions ◦ In the most difficult data sets can be the best way to identify important variables  Tree Net: combination of MANY small trees ◦ Best overall forecast performance in many cases ◦ Constrained models can be used to test the complexity of the data structure non-parametrically ◦ Exceptionally good with binary targets
  • 89. Neural Networks, combination of a few sigmoidal activation functions ◦ Very complex models can be represented in a very compact form ◦ Can accurately forecast both levels and slopes and even higher order derivatives ◦ Can efficiently use vector dependent variables  Cross equation constraints can be imposed. (see Symmetry constraints for feedforward network models of gradient systems, Cardell, Joerding, and Li, IEEE Transactions on Neural Networks, 1993) ◦ During deployment phase, forecasts can be computed very quickly  High voltage transmission lines use a neural network to detect whether there has been a lightning strike and are fast enough to shut down the line before it can be damaged  Kernel function estimators, use a local mean or a local regression ◦ Local estimates easy to understand and interpret ◦ Local regression versions can estimate slopes and levels ◦ Initial estimation can be quick
  • 90. Random Forests: ◦ Models are large, complex and un-interpretable ◦ Limited to moderate sample sizes (usually less than 100,000 observations) ◦ Hard to tell in advance which case Random Forests will work well on ◦ Deployed models require substantial computation  Tree Net ◦ Models are large and complex, interpretation requires additional work ◦ Deployed models either require substantial computation or post- processing of the original model into a more compact form  CART ◦ In most cases models are less accurate than TreeNet ◦ Works poorly in cases where effects are approximately linear in continuous variables or additive over many variables
  • 91. Neural Networks: ◦ Neural Networks cover such a wide variety of models that no good widely- applicable modeling software exists or may even be possible  The most dramatic successes have been with Neural Network models that are idiosyncratic to the specific case, and ere developed with great effort  Fully optimized Neural Network parameter estimates can be very difficult to compute, and sometimes perform substantially worse than initial statistically inferior estimates. (this is called the “over training” issue) ◦ In almost all cases initial estimation is very compute intensive ◦ Limited to very small numbers of variables (typically between about 6 and 20 depending on the application)  Kernel Function Estimators: ◦ Deployed models can require substantial computation ◦ Limited to small numbers of variables ◦ Sensitive to distance measures. Even a modest number of variables can degrade performance substantially, due to the influence of relatively unimportant variables on the distance metric
  • 92. Breiman, L., J. Friedman, R. Olshen and C. Stone (1984), Classification and Regression Trees, Pacific Grove: Wadsworth  Breiman, L. (1996). Bagging predictors. Machine Learning, 24, 123-140.  Hastie, T., Tibshirani, R., and Friedman, J.H (2000). The Elements of Statistical Learning. Springer.  Freund, Y. & Schapire, R. E. (1996). Experiments with a new boosting algorithm. In L. Saitta, ed., Machine Learning: Proceedings of the Thirteenth National Conference, Morgan Kaufmann, pp. 148-156.  Friedman, J.H. (1999). Stochastic gradient boosting. Stanford: Statistics Department, Stanford University.  Friedman, J.H. (1999). Greedy function approximation: a gradient boosting machine. Stanford: Statistics Department, Stanford University.