1015 track2 abbott

When Model Interpretation Matters:
Understanding Complex Predictive Models
Dean Abbott
Co-Founder and Chief Data Scientist, SmarterHQ
Twitter: @deanabb

Variable Importance in Linear Regression

Variable Importance in Decision Trees
• Decision Trees
– You think they are easy to explain?

Variable Importance in Neural Networks
• Huh?

Neural Networks:
Interpretation via Sensitivities

Variable Importance in Neural Networks: Weights

Other Ways to Compute Neural Network Sensitivities
Such as… http://www.palisade.com/downloads/pdf/academic/DTSpaper110915.pdf
And ftp://ftp.sas.com/pub/neural/importance.html#mlp_parder_interp
• Weight tracing – sum of product of weights (and variants)
• Partial derivatives – avg, avg absolute, squared, etc.
• Remove variable, compute change in accuracy

Naïve Bayes Model Outputs
Essentially a series of
cross-tabs for every
variable!
Remember, the final
probability is the
product of the
individual variable
probabilities.

What About Model Ensembles?
Decision Logic
Ensemble Prediction
10s to 100s of trees…

Outline
• Classical variable importance: linear regression
• Hack #1: using linear regression model statistics to
infer variable importance

The Data: Easiest Possible!
• 3 inputs: each is a random Normal: mean = 20, std = 5
• Target variable: 0.5*var1 + 0.2*var2 + 0.3*var3
• 95,412 records (same size as cup98lrn)

Linear Regression Coefficient
For Each Variable to Assess Influence
• Coefficient match (be definition) the proportions used to
be build the target variable
• This is the average influence of each input on the
predictions for all records

Assess Influence with t-proportion
For Each Variable
• I know I’m breaking rules here. Bear with me….

Assess Influence with t-proportion
For Each Variable
• T-value measures the significance of the relationship.
• It turns out, that the proportion of the t-values for the exact model
matches the coefficients

Assess Influence using Direct Measure of
Influence Proportion
• Compute the contribution of each term in the linear regression model
separately (each record).
– Var1_influence = $var1coef$ * $var1$, etc.
• Compute the proportion of the contribution of the predicted
target variable value
• Average the contributions of each variable for each record to compute the
average influence of each variable

So Far So Good
• Now let’s do the
same thing for
– Neural Networks
– Support Vector
Machines.

Motivation for Input Shuffling
http://www.elderresearch.com/company/target-shuffling

Why “Input Shuffling”?
• We don’t always have nice metrics
to assess inputs of predictive
models -- Neural Networks, SVM,
ensembles
– Contrast with statistical methods like
Regression
• Even with regression, we don’t
always have the right input
distributions so these metrics are
good indicators of variable
influence

Input Distributions Are Not Always Ideal

What does “Shuffled” mean?
• Scramble (randomly) a single input
variable
– Input Shuffling Node doesn’t have to be
in a loop; it can scramble a column while
leaving the others in their natural order
• Captures the actual distribution of
the data

Principles of Input Shuffling
• Key: randomly re-populate values of a single input variable while leaving
all other variables with their original values
• Compute the standard deviation (or some other measure of perturbation)
for each record
– Of the Predicted Target Variable – posterior probability
– NOT the actual target variable value
• This perturbation is a measure of how influential the variable is in the
model
– High standard deviation -> lots of influence
– Low standard deviation -> not much influence
– ~0 standard deviation -> no influence

Shuffled Inputs Meta Node
Two Loops: (1) loop on input variables and (2) shuffle input variable (50x or so)

The Input Shuffling Process
1. Build the predictive model
2. For a data subset (can use training, or some suitably sized set), N records
3. Loop over every variable
1. Loop M times (50 by default)
1. Shuffle the variable (keeping all other inputs for that row fixed)
2. Score the Model
3. Save the scores for the entire data set (you will end up with times the #
records)
2. Compute the standard deviation of the predictions for each row (or some other
measure of “spread”), i.e., group by Row ID, computing stdev. Now we have N records
again
3. Compute the average spread of an input over all N records, such as the mean of these
standard deviations, i.e., group by entire data set. Now we have 1 number, the
variable influence
4. Compare all results. Sort descending by variable influence.

Single Record: what it looks like
• After 50 “input shuffles”: Row0

Average for All Records in data
• Measures the spread of the predictions when randomly perturbing the single
input variable

Input Shuffling Result:
Idealized Linear Regression Data
• Compute proportion of the average standard deviation from shuffling the
input (keeping others with the original values)
• (yes, I know I’m averaging standard deviations!)
Target variable: 0.5*var1 + 0.2*var2 + 0.3*var3

Realistic Data: KDD Cup 1998
• 95,412: cup98lrn from KDD Cup 1998 Competition
– Use only the responders (4843) in linear regression models
• Hundreds of fields in data, but only use 4 for our purposes here
– LASTGIFT, NGIFTALL,
RFA_2F, D_RFA_2A
• Continuous target
• Two continuous inputs
• One ordinal input (RFA_2F)
• One dummy input (D_RFA_2A)

Realistic Data: KDD Cup 1998
• Heavy skew of LASTGIFT, NGIFTALL, TARGET_D
– Makes visualization difficult
– Biases
regression
coefficients
(if
one cares)
– So, do the usual
“best practices”

Normalized Data
• To remove influence of skew and scale
– Log10 transform LASTGIFT, NGIFTALL, TARGET_D
– Scale all variables (post log10) to [0, 1]

Normalized Data
• Relationships clearer
– LASTGIFT strong positive correlation with TARGET_D
– NGIFTALL, RFA_2F, D_RFA_2A all have apparently slight negative
correlation
with
TARGET_D

The Basic Model: Linear Regression
Coefficient
Use abs() for influence calculations

Linear Regression:
Compare Influence Using Different Methods
Coefficient t-Proportion
Use abs() for t-proportion calculationsUse abs() for influence calculations

Linear Regression:
Compare Influence Using Different Methods
Coefficient t-Proportion
Direct Proportion Input Shuffling Proportion
Use abs() for t-proportion calculationsUse abs() for calculations
Use abs() for t-proportion calculations

Linear Regression, Neural Network: Input Shuffling
Influence
Input Shuffling- LR Input Shuffling - MLP

Applying Input Shuffling to Classification: Logistic Regression
Start simple: just 4 variables (like the regression example

Applying Input Shuffling to Classification: Logistic Regression
Influence Based on Proportion of z-score Influence Based on Input Shuffling

Ranking Larger Numbers of Variables

Conclusion
• Input shuffling can generate model sensitivity scores for
any model, no matter how complex or nonlinear
• Input shuffling can be applied to any algorithm, no
matter how linear or nonlinear the algorithm is
• Matches linear regression variable influence (t-value)
• Similar to logistic regression variable influence (z-
score)

Future Work
• If model predictions (scores) are not normally distributed, and if the influence
is not uniform, average overall influence doesn’t tell the full story (or may even
tell a misleading story) about how valuable the variable is in predicting the
target
– Break predictions into bins (deciles or other number of bins) allows us to compute
an influence score for every part of the predicted range
– Answers the question: for high predicted values, which variables are most
influential
• Build score influence rather than prediction influence
– Use ROC AUC statistics for each shuffled input, and determine the influence of each variable
on the model score rather than the predicted value

1015 track2 abbott

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Was ist angesagt?

Was ist angesagt? (18)

Ähnlich wie 1015 track2 abbott

Ähnlich wie 1015 track2 abbott (20)

Mehr von Rising Media, Inc.

Mehr von Rising Media, Inc. (20)

Kürzlich hochgeladen

Kürzlich hochgeladen (20)

1015 track2 abbott