2. Know thy tools
Stop treating data miners as black boxes.
Looking inside is (1) fun, (2) easy, (3) needed.
2
3. INFOGAIN: (the Fayyad and Irani MDL discretizer) in 55 lines
https://raw.githubusercontent.com/timm/axe/master/old/ediv.py
Input: [ (1,X), (2,X), (3,X), (4,X), (11,Y), (12,Y), (13,Y), (14,Y) ]
Output: 1, 11 dsfdsdssdsdsddsdsdsfsdfsdsdfsdsdf
3
E = Σ –p*log2(p)
4. Know thy tools
Stop treating data miners as black boxes.
Looking inside is (1) fun, (2) easy, (3) needed.
4
5. Know thy tools
Stop treating data miners as black boxes.
Looking inside is (1) fun, (2) easy, (3) needed.
5
6. It doesn't matter what you do but
does matter who does it!
Martin Shepperd, Brunel University, West London, UK
http://crest.cs.ucl.ac.uk/?id=3695
6
7. Systematic Review
• Conducted by Tracy Hall and David Bowes
– T. Hall, S. Beecham, D. Bowes, D. Gray, and S. Counsell. “A systematic
literature review on fault prediction performance in software
engineering”, Accepted for publication in TSE (download from BURA).
• Located 208 relevant primary studies
• Due to reporting requirements used 18
studies that contain 194 results
– binary classifiers, confusion matrix, context details
7
8. Matthews correlation coefficient
8
MCC
Dataset$MCC
frequency
-0.2 0.0 0.2 0.4 0.6 0.8
0102030405060
-2 -1 0 1 2
-0.20.00.20.40.60.8
rnorm(194)
Dataset$MCC
TABLE IV
COMPOSITE PERFORMANCE MEASURES
Defined as Description
detection)
TP/ (TP + F N ) Proportion of faulty units cor
TP/ (TP + F P)
Proportion of units correctl
faulty
alse alarm)
F P/ (F P + TN )
Proportion of non-faulty un
classified
TN/ (TN + F P)
Proportion of correctly classi
units
2·R ecal l ·P r eci si on
R ecal l + P r eci si on
Most commonly defined as
mean of precision and recall
( T N + T P )
(T N + F N + F P + T P )
Proportion of correctly classifi
on Coefficient
T P ⇥T N − F P ⇥F Np
(T P + F P )( T P + F N )(T N + F P )(T N + F N )
Combines all quadrants of th
sion matrix to produce avalue
to +1 with 0 indicating random
tween the prediction and the r
MCC can betested for statistic
with χ2 = N · M CC2 where
number of instances.