Ensure the security of your HCL environment by applying the Zero Trust princi...
Mevsys Data Mining: Knowledge Discovery.
1. MEVSYS DATA MINING
KNOWLEDGE DISCOVERY
Exploration of models’
used variables and their
predictive functioning.
2013 1
TO PROTECT CUSTOMER CONFIDENTIALITY SOME REFERENCES HAVE BEEN OMMITED AND/OR GENERALIZED
PERCENTAGES AND RESULTS ARE KEPT UNTOUCHED
2. WHAT IS KNOWLEDGE DISCOVERY?
A direct marketing model was used to sort clients according to their
response probability to new campaigns. With just a score number,
which is the model’s output, was enough to be put to use.
Nevertheless, the company’s product department wanted to gain
some insight into what variables were selected and how they
were being used by the predictive model.
2013
TO PROTECT CUSTOMER CONFIDENTIALITY SOME REFERENCES HAVE BEEN OMMITED AND/OR GENERALIZED
PERCENTAGES AND RESULTS ARE KEPT UNTOUCHED
2
PREDICTIVE
VARIABLES
OUTPUT
SCORE
3. DECISION TREE INTERPRETATION
The model’s ease of interpretation depends on the used algorithm. Decision trees are
one of the most intuitive models to interpret, so we specifically created one to find
some relations between the predictive variables and customer response rate.
As it can be seen in this example, a decision tree splits data using one variable at a
time into different branches, which it then proceedes to split again with other variables.
This process is repeated until no further branching increases performance.
2013
TO PROTECT CUSTOMER CONFIDENTIALITY SOME REFERENCES HAVE BEEN OMMITED AND/OR GENERALIZED
PERCENTAGES AND RESULTS ARE KEPT UNTOUCHED
3
A classic and simple decision
tree example in which the
model decides whether its
possible or not to play golf.
4. DECISION TREE’S BEST RULES
2013
TO PROTECT CUSTOMER CONFIDENTIALITY SOME REFERENCES HAVE BEEN OMMITED AND/OR GENERALIZED
PERCENTAGES AND RESULTS ARE KEPT UNTOUCHED
4
As a decision tree splits, it creates a set of rules based on variables. The
following is the best ruleset found, which means the customers selected
yielded the group with the highest response rate.
1. FIELD1 > 8
2. FIELD2 <= 37
3. FIELD3 > 0 * this was a 0/1 dummy variable, so this rule could also be expressed as FIELD3 = 1.
4. FIELD4 = 0 * this was another dummy variable.
5. FIELD5 <= 24740
6. FIELD6 > 272
Knowing what these fields were, the company was able to gain valuable
insight into what type of customers were answering their marketing calls.
5. DECISION TREE’S FIRST SPLIT GRAPH
2013
TO PROTECT CUSTOMER CONFIDENTIALITY SOME REFERENCES HAVE BEEN OMMITED AND/OR GENERALIZED
PERCENTAGES AND RESULTS ARE KEPT UNTOUCHED
5
split point = 8
+ response rate
- response rate
6. COMPLEX MODELS
2013
TO PROTECT CUSTOMER CONFIDENTIALITY SOME REFERENCES HAVE BEEN OMMITED AND/OR GENERALIZED
PERCENTAGES AND RESULTS ARE KEPT UNTOUCHED
6
Neural networks, support vector machines and other algorithms are more complex and
don’t present such easy to read rules, being known as black boxes into which not much
can be seen.
Nevertheless, there are some ways of analyzing variable importance. One of them
consists in leaving out one variable at a time, recording the model’s drop in
performance and therefore calculating each a relative predictive value.
FIELD1 0,1291
FIELD2 0,1036
FIELD3 0,0961
FIELD4 0,0884
FIELD5 0,0860
FIELD6 0,0830
FIELD7 0,0794
FIELD8 0,0644
FIELD9 0,0320
FIELD10 0,0262 0,0000
0,0200
0,0400
0,0600
0,0800
0,1000
0,1200
0,1400
7. VARIABLE CORRELATION GRAPHS
2013
TO PROTECT CUSTOMER CONFIDENTIALITY SOME REFERENCES HAVE BEEN OMMITED AND/OR GENERALIZED
PERCENTAGES AND RESULTS ARE KEPT UNTOUCHED
7
Once the most predictive variables were determined, we took a look at their graphs
against response rate to visually see correlations and understand how each variable
individually affected the outcome.
+ response rate
(negative correlation)
8. SUMMARY
2013
TO PROTECT CUSTOMER CONFIDENTIALITY SOME REFERENCES HAVE BEEN OMMITED AND/OR GENERALIZED
PERCENTAGES AND RESULTS ARE KEPT UNTOUCHED
8
Specially for knowledge discovery we made a decision tree, which
even if it didn’t performed as well as the used model for
deployement, it presented an opportunity to discover intuitive rules.
On the other hand we explored variable importance and correlation
to the outcome. Often our customer was surprised, discovering
relationships they had never thought about before or even
contradicting their suppositions.
Knowledge discovery on data mining allows to take a look at the
relationships between the available variables and the outcome of
interest without having to manually search through each and every
single one of them, allowing to concentrate focus on the most
relevant fields.