SlideShare ist ein Scribd-Unternehmen logo
1 von 12
1
Data Mining
tim.menzies@gmail.com
Know thy tools
Stop treating data miners as black boxes.
Looking inside is (1) fun, (2) easy, (3) needed.
2
INFOGAIN: (the Fayyad and Irani MDL discretizer) in 55 lines
https://raw.githubusercontent.com/timm/axe/master/old/ediv.py
Input: [ (1,X), (2,X), (3,X), (4,X), (11,Y), (12,Y), (13,Y), (14,Y) ]
Output: 1, 11 dsfdsdssdsdsddsdsdsfsdfsdsdfsdsdf
3
E = Σ –p*log2(p)
Know thy tools
Stop treating data miners as black boxes.
Looking inside is (1) fun, (2) easy, (3) needed.
4
Know thy tools
Stop treating data miners as black boxes.
Looking inside is (1) fun, (2) easy, (3) needed.
5
It doesn't matter what you do but
does matter who does it!
Martin Shepperd, Brunel University, West London, UK
http://crest.cs.ucl.ac.uk/?id=3695
6
Systematic Review
• Conducted by Tracy Hall and David Bowes
– T. Hall, S. Beecham, D. Bowes, D. Gray, and S. Counsell. “A systematic
literature review on fault prediction performance in software
engineering”, Accepted for publication in TSE (download from BURA).
• Located 208 relevant primary studies
• Due to reporting requirements used 18
studies that contain 194 results
– binary classifiers, confusion matrix, context details
7
Matthews correlation coefficient
8
MCC
Dataset$MCC
frequency
-0.2 0.0 0.2 0.4 0.6 0.8
0102030405060
-2 -1 0 1 2
-0.20.00.20.40.60.8
rnorm(194)
Dataset$MCC
TABLE IV
COMPOSITE PERFORMANCE MEASURES
Defined as Description
detection)
TP/ (TP + F N ) Proportion of faulty units cor
TP/ (TP + F P)
Proportion of units correctl
faulty
alse alarm)
F P/ (F P + TN )
Proportion of non-faulty un
classified
TN/ (TN + F P)
Proportion of correctly classi
units
2·R ecal l ·P r eci si on
R ecal l + P r eci si on
Most commonly defined as
mean of precision and recall
( T N + T P )
(T N + F N + F P + T P )
Proportion of correctly classifi
on Coefficient
T P ⇥T N − F P ⇥F Np
(T P + F P )( T P + F N )(T N + F P )(T N + F N )
Combines all quadrants of th
sion matrix to produce avalue
to +1 with 0 indicating random
tween the prediction and the r
MCC can betested for statistic
with χ2 = N · M CC2 where
number of instances.
(iv) Research Group
9
ANOVA Results
Factor % of var
Author group 61%
Metric family 3%
Author/metric 9%
Everything else 8% (but not significant)
Residuals 19%
10
Final word
We cannot ignore the fact that
the main determinant of a
validation study result is which
research group undertakes it.
11
Know thy tools
Stop treating data miners as black boxes.
Looking inside is (1) fun, (2) easy, (3) needed.
12

Weitere ähnliche Inhalte

Ähnlich wie Know thy tools

7076 chapter5 slides
7076 chapter5 slides7076 chapter5 slides
7076 chapter5 slidesNguyen Mina
 
Randić Index of Some Class of Trees with an Algorithm
Randić Index of Some Class of Trees with an AlgorithmRandić Index of Some Class of Trees with an Algorithm
Randić Index of Some Class of Trees with an AlgorithmEditor IJCATR
 
Métodos computacionales para el estudio de modelos epidemiológicos con incer...
Métodos computacionales para el estudio de modelos  epidemiológicos con incer...Métodos computacionales para el estudio de modelos  epidemiológicos con incer...
Métodos computacionales para el estudio de modelos epidemiológicos con incer...Facultad de Informática UCM
 
Trunsored data analysis
Trunsored data analysisTrunsored data analysis
Trunsored data analysisHideo Hirose
 
Deep Learning: Recurrent Neural Network (Chapter 10)
Deep Learning: Recurrent Neural Network (Chapter 10) Deep Learning: Recurrent Neural Network (Chapter 10)
Deep Learning: Recurrent Neural Network (Chapter 10) Larry Guo
 
MCQMC 2020 talk: Importance Sampling for a Robust and Efficient Multilevel Mo...
MCQMC 2020 talk: Importance Sampling for a Robust and Efficient Multilevel Mo...MCQMC 2020 talk: Importance Sampling for a Robust and Efficient Multilevel Mo...
MCQMC 2020 talk: Importance Sampling for a Robust and Efficient Multilevel Mo...Chiheb Ben Hammouda
 
Skiena algorithm 2007 lecture09 linear sorting
Skiena algorithm 2007 lecture09 linear sortingSkiena algorithm 2007 lecture09 linear sorting
Skiena algorithm 2007 lecture09 linear sortingzukun
 
DPPs everywhere: repulsive point processes for Monte Carlo integration, signa...
DPPs everywhere: repulsive point processes for Monte Carlo integration, signa...DPPs everywhere: repulsive point processes for Monte Carlo integration, signa...
DPPs everywhere: repulsive point processes for Monte Carlo integration, signa...Advanced-Concepts-Team
 
A common random fixed point theorem for rational ineqality in hilbert space ...
 A common random fixed point theorem for rational ineqality in hilbert space ... A common random fixed point theorem for rational ineqality in hilbert space ...
A common random fixed point theorem for rational ineqality in hilbert space ...Alexander Decker
 
Selm Falzon Compressed
Selm Falzon CompressedSelm Falzon Compressed
Selm Falzon Compressedgfalzon2
 
Reading Seminar (140515) Spectral Learning of L-PCFGs
Reading Seminar (140515) Spectral Learning of L-PCFGsReading Seminar (140515) Spectral Learning of L-PCFGs
Reading Seminar (140515) Spectral Learning of L-PCFGsKeisuke OTAKI
 
MLP輪読スパース8章 トレースノルム正則化
MLP輪読スパース8章 トレースノルム正則化MLP輪読スパース8章 トレースノルム正則化
MLP輪読スパース8章 トレースノルム正則化Akira Tanimoto
 
Sampling and Reconstruction (Online Learning).pptx
Sampling and Reconstruction (Online Learning).pptxSampling and Reconstruction (Online Learning).pptx
Sampling and Reconstruction (Online Learning).pptxHamzaJaved306957
 
Seminar Talk: Multilevel Hybrid Split Step Implicit Tau-Leap for Stochastic R...
Seminar Talk: Multilevel Hybrid Split Step Implicit Tau-Leap for Stochastic R...Seminar Talk: Multilevel Hybrid Split Step Implicit Tau-Leap for Stochastic R...
Seminar Talk: Multilevel Hybrid Split Step Implicit Tau-Leap for Stochastic R...Chiheb Ben Hammouda
 
Computational Information Geometry on Matrix Manifolds (ICTP 2013)
Computational Information Geometry on Matrix Manifolds (ICTP 2013)Computational Information Geometry on Matrix Manifolds (ICTP 2013)
Computational Information Geometry on Matrix Manifolds (ICTP 2013)Frank Nielsen
 

Ähnlich wie Know thy tools (20)

7076 chapter5 slides
7076 chapter5 slides7076 chapter5 slides
7076 chapter5 slides
 
Randić Index of Some Class of Trees with an Algorithm
Randić Index of Some Class of Trees with an AlgorithmRandić Index of Some Class of Trees with an Algorithm
Randić Index of Some Class of Trees with an Algorithm
 
Randomized algorithms ver 1.0
Randomized algorithms ver 1.0Randomized algorithms ver 1.0
Randomized algorithms ver 1.0
 
Métodos computacionales para el estudio de modelos epidemiológicos con incer...
Métodos computacionales para el estudio de modelos  epidemiológicos con incer...Métodos computacionales para el estudio de modelos  epidemiológicos con incer...
Métodos computacionales para el estudio de modelos epidemiológicos con incer...
 
Trunsored data analysis
Trunsored data analysisTrunsored data analysis
Trunsored data analysis
 
Deep Learning: Recurrent Neural Network (Chapter 10)
Deep Learning: Recurrent Neural Network (Chapter 10) Deep Learning: Recurrent Neural Network (Chapter 10)
Deep Learning: Recurrent Neural Network (Chapter 10)
 
MCQMC 2020 talk: Importance Sampling for a Robust and Efficient Multilevel Mo...
MCQMC 2020 talk: Importance Sampling for a Robust and Efficient Multilevel Mo...MCQMC 2020 talk: Importance Sampling for a Robust and Efficient Multilevel Mo...
MCQMC 2020 talk: Importance Sampling for a Robust and Efficient Multilevel Mo...
 
Skiena algorithm 2007 lecture09 linear sorting
Skiena algorithm 2007 lecture09 linear sortingSkiena algorithm 2007 lecture09 linear sorting
Skiena algorithm 2007 lecture09 linear sorting
 
DPPs everywhere: repulsive point processes for Monte Carlo integration, signa...
DPPs everywhere: repulsive point processes for Monte Carlo integration, signa...DPPs everywhere: repulsive point processes for Monte Carlo integration, signa...
DPPs everywhere: repulsive point processes for Monte Carlo integration, signa...
 
A common random fixed point theorem for rational ineqality in hilbert space ...
 A common random fixed point theorem for rational ineqality in hilbert space ... A common random fixed point theorem for rational ineqality in hilbert space ...
A common random fixed point theorem for rational ineqality in hilbert space ...
 
Selm Falzon Compressed
Selm Falzon CompressedSelm Falzon Compressed
Selm Falzon Compressed
 
Reading Seminar (140515) Spectral Learning of L-PCFGs
Reading Seminar (140515) Spectral Learning of L-PCFGsReading Seminar (140515) Spectral Learning of L-PCFGs
Reading Seminar (140515) Spectral Learning of L-PCFGs
 
Chap04
Chap04Chap04
Chap04
 
MLP輪読スパース8章 トレースノルム正則化
MLP輪読スパース8章 トレースノルム正則化MLP輪読スパース8章 トレースノルム正則化
MLP輪読スパース8章 トレースノルム正則化
 
Sampling and Reconstruction (Online Learning).pptx
Sampling and Reconstruction (Online Learning).pptxSampling and Reconstruction (Online Learning).pptx
Sampling and Reconstruction (Online Learning).pptx
 
Lec9
Lec9Lec9
Lec9
 
Adaline and Madaline.ppt
Adaline and Madaline.pptAdaline and Madaline.ppt
Adaline and Madaline.ppt
 
Seminar Talk: Multilevel Hybrid Split Step Implicit Tau-Leap for Stochastic R...
Seminar Talk: Multilevel Hybrid Split Step Implicit Tau-Leap for Stochastic R...Seminar Talk: Multilevel Hybrid Split Step Implicit Tau-Leap for Stochastic R...
Seminar Talk: Multilevel Hybrid Split Step Implicit Tau-Leap for Stochastic R...
 
MUMS: Bayesian, Fiducial, and Frequentist Conference - Objective Bayesian Ana...
MUMS: Bayesian, Fiducial, and Frequentist Conference - Objective Bayesian Ana...MUMS: Bayesian, Fiducial, and Frequentist Conference - Objective Bayesian Ana...
MUMS: Bayesian, Fiducial, and Frequentist Conference - Objective Bayesian Ana...
 
Computational Information Geometry on Matrix Manifolds (ICTP 2013)
Computational Information Geometry on Matrix Manifolds (ICTP 2013)Computational Information Geometry on Matrix Manifolds (ICTP 2013)
Computational Information Geometry on Matrix Manifolds (ICTP 2013)
 

Mehr von CS, NcState

Talks2015 novdec
Talks2015 novdecTalks2015 novdec
Talks2015 novdecCS, NcState
 
GALE: Geometric active learning for Search-Based Software Engineering
GALE: Geometric active learning for Search-Based Software EngineeringGALE: Geometric active learning for Search-Based Software Engineering
GALE: Geometric active learning for Search-Based Software EngineeringCS, NcState
 
Big Data: the weakest link
Big Data: the weakest linkBig Data: the weakest link
Big Data: the weakest linkCS, NcState
 
Three Laws of Trusted Data Sharing: (Building a Better Business Case for Dat...
Three Laws of Trusted Data Sharing:(Building a Better Business Case for Dat...Three Laws of Trusted Data Sharing:(Building a Better Business Case for Dat...
Three Laws of Trusted Data Sharing: (Building a Better Business Case for Dat...CS, NcState
 
Lexisnexis june9
Lexisnexis june9Lexisnexis june9
Lexisnexis june9CS, NcState
 
Welcome to ICSE NIER’15 (new ideas and emerging results).
Welcome to ICSE NIER’15 (new ideas and emerging results).Welcome to ICSE NIER’15 (new ideas and emerging results).
Welcome to ICSE NIER’15 (new ideas and emerging results).CS, NcState
 
Icse15 Tech-briefing Data Science
Icse15 Tech-briefing Data ScienceIcse15 Tech-briefing Data Science
Icse15 Tech-briefing Data ScienceCS, NcState
 
Kits to Find the Bits that Fits
Kits to Find  the Bits that Fits Kits to Find  the Bits that Fits
Kits to Find the Bits that Fits CS, NcState
 
Ai4se lab template
Ai4se lab templateAi4se lab template
Ai4se lab templateCS, NcState
 
Automated Software Enging, Fall 2015, NCSU
Automated Software Enging, Fall 2015, NCSUAutomated Software Enging, Fall 2015, NCSU
Automated Software Enging, Fall 2015, NCSUCS, NcState
 
Requirements Engineering
Requirements EngineeringRequirements Engineering
Requirements EngineeringCS, NcState
 
172529main ken and_tim_software_assurance_research_at_west_virginia
172529main ken and_tim_software_assurance_research_at_west_virginia172529main ken and_tim_software_assurance_research_at_west_virginia
172529main ken and_tim_software_assurance_research_at_west_virginiaCS, NcState
 
Automated Software Engineering
Automated Software EngineeringAutomated Software Engineering
Automated Software EngineeringCS, NcState
 
Next Generation “Treatment Learning” (finding the diamonds in the dust)
Next Generation “Treatment Learning” (finding the diamonds in the dust)Next Generation “Treatment Learning” (finding the diamonds in the dust)
Next Generation “Treatment Learning” (finding the diamonds in the dust)CS, NcState
 
Tim Menzies, directions in Data Science
Tim Menzies, directions in Data ScienceTim Menzies, directions in Data Science
Tim Menzies, directions in Data ScienceCS, NcState
 
Dagstuhl14 intro-v1
Dagstuhl14 intro-v1Dagstuhl14 intro-v1
Dagstuhl14 intro-v1CS, NcState
 
The Art and Science of Analyzing Software Data
The Art and Science of Analyzing Software DataThe Art and Science of Analyzing Software Data
The Art and Science of Analyzing Software DataCS, NcState
 
What Metrics Matter?
What Metrics Matter? What Metrics Matter?
What Metrics Matter? CS, NcState
 

Mehr von CS, NcState (20)

Talks2015 novdec
Talks2015 novdecTalks2015 novdec
Talks2015 novdec
 
Future se oct15
Future se oct15Future se oct15
Future se oct15
 
GALE: Geometric active learning for Search-Based Software Engineering
GALE: Geometric active learning for Search-Based Software EngineeringGALE: Geometric active learning for Search-Based Software Engineering
GALE: Geometric active learning for Search-Based Software Engineering
 
Big Data: the weakest link
Big Data: the weakest linkBig Data: the weakest link
Big Data: the weakest link
 
Three Laws of Trusted Data Sharing: (Building a Better Business Case for Dat...
Three Laws of Trusted Data Sharing:(Building a Better Business Case for Dat...Three Laws of Trusted Data Sharing:(Building a Better Business Case for Dat...
Three Laws of Trusted Data Sharing: (Building a Better Business Case for Dat...
 
Lexisnexis june9
Lexisnexis june9Lexisnexis june9
Lexisnexis june9
 
Welcome to ICSE NIER’15 (new ideas and emerging results).
Welcome to ICSE NIER’15 (new ideas and emerging results).Welcome to ICSE NIER’15 (new ideas and emerging results).
Welcome to ICSE NIER’15 (new ideas and emerging results).
 
Icse15 Tech-briefing Data Science
Icse15 Tech-briefing Data ScienceIcse15 Tech-briefing Data Science
Icse15 Tech-briefing Data Science
 
Kits to Find the Bits that Fits
Kits to Find  the Bits that Fits Kits to Find  the Bits that Fits
Kits to Find the Bits that Fits
 
Ai4se lab template
Ai4se lab templateAi4se lab template
Ai4se lab template
 
Automated Software Enging, Fall 2015, NCSU
Automated Software Enging, Fall 2015, NCSUAutomated Software Enging, Fall 2015, NCSU
Automated Software Enging, Fall 2015, NCSU
 
Requirements Engineering
Requirements EngineeringRequirements Engineering
Requirements Engineering
 
172529main ken and_tim_software_assurance_research_at_west_virginia
172529main ken and_tim_software_assurance_research_at_west_virginia172529main ken and_tim_software_assurance_research_at_west_virginia
172529main ken and_tim_software_assurance_research_at_west_virginia
 
Automated Software Engineering
Automated Software EngineeringAutomated Software Engineering
Automated Software Engineering
 
Next Generation “Treatment Learning” (finding the diamonds in the dust)
Next Generation “Treatment Learning” (finding the diamonds in the dust)Next Generation “Treatment Learning” (finding the diamonds in the dust)
Next Generation “Treatment Learning” (finding the diamonds in the dust)
 
Tim Menzies, directions in Data Science
Tim Menzies, directions in Data ScienceTim Menzies, directions in Data Science
Tim Menzies, directions in Data Science
 
Goldrush
GoldrushGoldrush
Goldrush
 
Dagstuhl14 intro-v1
Dagstuhl14 intro-v1Dagstuhl14 intro-v1
Dagstuhl14 intro-v1
 
The Art and Science of Analyzing Software Data
The Art and Science of Analyzing Software DataThe Art and Science of Analyzing Software Data
The Art and Science of Analyzing Software Data
 
What Metrics Matter?
What Metrics Matter? What Metrics Matter?
What Metrics Matter?
 

Kürzlich hochgeladen

The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...
The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...
The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...ranjana rawat
 
data_management_and _data_science_cheat_sheet.pdf
data_management_and _data_science_cheat_sheet.pdfdata_management_and _data_science_cheat_sheet.pdf
data_management_and _data_science_cheat_sheet.pdfJiananWang21
 
Call Girls Pimpri Chinchwad Call Me 7737669865 Budget Friendly No Advance Boo...
Call Girls Pimpri Chinchwad Call Me 7737669865 Budget Friendly No Advance Boo...Call Girls Pimpri Chinchwad Call Me 7737669865 Budget Friendly No Advance Boo...
Call Girls Pimpri Chinchwad Call Me 7737669865 Budget Friendly No Advance Boo...roncy bisnoi
 
Coefficient of Thermal Expansion and their Importance.pptx
Coefficient of Thermal Expansion and their Importance.pptxCoefficient of Thermal Expansion and their Importance.pptx
Coefficient of Thermal Expansion and their Importance.pptxAsutosh Ranjan
 
CCS335 _ Neural Networks and Deep Learning Laboratory_Lab Complete Record
CCS335 _ Neural Networks and Deep Learning Laboratory_Lab Complete RecordCCS335 _ Neural Networks and Deep Learning Laboratory_Lab Complete Record
CCS335 _ Neural Networks and Deep Learning Laboratory_Lab Complete RecordAsst.prof M.Gokilavani
 
Thermal Engineering-R & A / C - unit - V
Thermal Engineering-R & A / C - unit - VThermal Engineering-R & A / C - unit - V
Thermal Engineering-R & A / C - unit - VDineshKumar4165
 
result management system report for college project
result management system report for college projectresult management system report for college project
result management system report for college projectTonystark477637
 
Java Programming :Event Handling(Types of Events)
Java Programming :Event Handling(Types of Events)Java Programming :Event Handling(Types of Events)
Java Programming :Event Handling(Types of Events)simmis5
 
PVC VS. FIBERGLASS (FRP) GRAVITY SEWER - UNI BELL
PVC VS. FIBERGLASS (FRP) GRAVITY SEWER - UNI BELLPVC VS. FIBERGLASS (FRP) GRAVITY SEWER - UNI BELL
PVC VS. FIBERGLASS (FRP) GRAVITY SEWER - UNI BELLManishPatel169454
 
AKTU Computer Networks notes --- Unit 3.pdf
AKTU Computer Networks notes ---  Unit 3.pdfAKTU Computer Networks notes ---  Unit 3.pdf
AKTU Computer Networks notes --- Unit 3.pdfankushspencer015
 
University management System project report..pdf
University management System project report..pdfUniversity management System project report..pdf
University management System project report..pdfKamal Acharya
 
ONLINE FOOD ORDER SYSTEM PROJECT REPORT.pdf
ONLINE FOOD ORDER SYSTEM PROJECT REPORT.pdfONLINE FOOD ORDER SYSTEM PROJECT REPORT.pdf
ONLINE FOOD ORDER SYSTEM PROJECT REPORT.pdfKamal Acharya
 
Glass Ceramics: Processing and Properties
Glass Ceramics: Processing and PropertiesGlass Ceramics: Processing and Properties
Glass Ceramics: Processing and PropertiesPrabhanshu Chaturvedi
 
UNIT-III FMM. DIMENSIONAL ANALYSIS
UNIT-III FMM.        DIMENSIONAL ANALYSISUNIT-III FMM.        DIMENSIONAL ANALYSIS
UNIT-III FMM. DIMENSIONAL ANALYSISrknatarajan
 
UNIT-II FMM-Flow Through Circular Conduits
UNIT-II FMM-Flow Through Circular ConduitsUNIT-II FMM-Flow Through Circular Conduits
UNIT-II FMM-Flow Through Circular Conduitsrknatarajan
 
BSides Seattle 2024 - Stopping Ethan Hunt From Taking Your Data.pptx
BSides Seattle 2024 - Stopping Ethan Hunt From Taking Your Data.pptxBSides Seattle 2024 - Stopping Ethan Hunt From Taking Your Data.pptx
BSides Seattle 2024 - Stopping Ethan Hunt From Taking Your Data.pptxfenichawla
 
Extrusion Processes and Their Limitations
Extrusion Processes and Their LimitationsExtrusion Processes and Their Limitations
Extrusion Processes and Their Limitations120cr0395
 

Kürzlich hochgeladen (20)

The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...
The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...
The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...
 
data_management_and _data_science_cheat_sheet.pdf
data_management_and _data_science_cheat_sheet.pdfdata_management_and _data_science_cheat_sheet.pdf
data_management_and _data_science_cheat_sheet.pdf
 
Call Girls Pimpri Chinchwad Call Me 7737669865 Budget Friendly No Advance Boo...
Call Girls Pimpri Chinchwad Call Me 7737669865 Budget Friendly No Advance Boo...Call Girls Pimpri Chinchwad Call Me 7737669865 Budget Friendly No Advance Boo...
Call Girls Pimpri Chinchwad Call Me 7737669865 Budget Friendly No Advance Boo...
 
(INDIRA) Call Girl Meerut Call Now 8617697112 Meerut Escorts 24x7
(INDIRA) Call Girl Meerut Call Now 8617697112 Meerut Escorts 24x7(INDIRA) Call Girl Meerut Call Now 8617697112 Meerut Escorts 24x7
(INDIRA) Call Girl Meerut Call Now 8617697112 Meerut Escorts 24x7
 
Coefficient of Thermal Expansion and their Importance.pptx
Coefficient of Thermal Expansion and their Importance.pptxCoefficient of Thermal Expansion and their Importance.pptx
Coefficient of Thermal Expansion and their Importance.pptx
 
CCS335 _ Neural Networks and Deep Learning Laboratory_Lab Complete Record
CCS335 _ Neural Networks and Deep Learning Laboratory_Lab Complete RecordCCS335 _ Neural Networks and Deep Learning Laboratory_Lab Complete Record
CCS335 _ Neural Networks and Deep Learning Laboratory_Lab Complete Record
 
Thermal Engineering-R & A / C - unit - V
Thermal Engineering-R & A / C - unit - VThermal Engineering-R & A / C - unit - V
Thermal Engineering-R & A / C - unit - V
 
result management system report for college project
result management system report for college projectresult management system report for college project
result management system report for college project
 
Java Programming :Event Handling(Types of Events)
Java Programming :Event Handling(Types of Events)Java Programming :Event Handling(Types of Events)
Java Programming :Event Handling(Types of Events)
 
PVC VS. FIBERGLASS (FRP) GRAVITY SEWER - UNI BELL
PVC VS. FIBERGLASS (FRP) GRAVITY SEWER - UNI BELLPVC VS. FIBERGLASS (FRP) GRAVITY SEWER - UNI BELL
PVC VS. FIBERGLASS (FRP) GRAVITY SEWER - UNI BELL
 
AKTU Computer Networks notes --- Unit 3.pdf
AKTU Computer Networks notes ---  Unit 3.pdfAKTU Computer Networks notes ---  Unit 3.pdf
AKTU Computer Networks notes --- Unit 3.pdf
 
University management System project report..pdf
University management System project report..pdfUniversity management System project report..pdf
University management System project report..pdf
 
(INDIRA) Call Girl Bhosari Call Now 8617697112 Bhosari Escorts 24x7
(INDIRA) Call Girl Bhosari Call Now 8617697112 Bhosari Escorts 24x7(INDIRA) Call Girl Bhosari Call Now 8617697112 Bhosari Escorts 24x7
(INDIRA) Call Girl Bhosari Call Now 8617697112 Bhosari Escorts 24x7
 
ONLINE FOOD ORDER SYSTEM PROJECT REPORT.pdf
ONLINE FOOD ORDER SYSTEM PROJECT REPORT.pdfONLINE FOOD ORDER SYSTEM PROJECT REPORT.pdf
ONLINE FOOD ORDER SYSTEM PROJECT REPORT.pdf
 
Glass Ceramics: Processing and Properties
Glass Ceramics: Processing and PropertiesGlass Ceramics: Processing and Properties
Glass Ceramics: Processing and Properties
 
UNIT-III FMM. DIMENSIONAL ANALYSIS
UNIT-III FMM.        DIMENSIONAL ANALYSISUNIT-III FMM.        DIMENSIONAL ANALYSIS
UNIT-III FMM. DIMENSIONAL ANALYSIS
 
UNIT-II FMM-Flow Through Circular Conduits
UNIT-II FMM-Flow Through Circular ConduitsUNIT-II FMM-Flow Through Circular Conduits
UNIT-II FMM-Flow Through Circular Conduits
 
Roadmap to Membership of RICS - Pathways and Routes
Roadmap to Membership of RICS - Pathways and RoutesRoadmap to Membership of RICS - Pathways and Routes
Roadmap to Membership of RICS - Pathways and Routes
 
BSides Seattle 2024 - Stopping Ethan Hunt From Taking Your Data.pptx
BSides Seattle 2024 - Stopping Ethan Hunt From Taking Your Data.pptxBSides Seattle 2024 - Stopping Ethan Hunt From Taking Your Data.pptx
BSides Seattle 2024 - Stopping Ethan Hunt From Taking Your Data.pptx
 
Extrusion Processes and Their Limitations
Extrusion Processes and Their LimitationsExtrusion Processes and Their Limitations
Extrusion Processes and Their Limitations
 

Know thy tools

  • 2. Know thy tools Stop treating data miners as black boxes. Looking inside is (1) fun, (2) easy, (3) needed. 2
  • 3. INFOGAIN: (the Fayyad and Irani MDL discretizer) in 55 lines https://raw.githubusercontent.com/timm/axe/master/old/ediv.py Input: [ (1,X), (2,X), (3,X), (4,X), (11,Y), (12,Y), (13,Y), (14,Y) ] Output: 1, 11 dsfdsdssdsdsddsdsdsfsdfsdsdfsdsdf 3 E = Σ –p*log2(p)
  • 4. Know thy tools Stop treating data miners as black boxes. Looking inside is (1) fun, (2) easy, (3) needed. 4
  • 5. Know thy tools Stop treating data miners as black boxes. Looking inside is (1) fun, (2) easy, (3) needed. 5
  • 6. It doesn't matter what you do but does matter who does it! Martin Shepperd, Brunel University, West London, UK http://crest.cs.ucl.ac.uk/?id=3695 6
  • 7. Systematic Review • Conducted by Tracy Hall and David Bowes – T. Hall, S. Beecham, D. Bowes, D. Gray, and S. Counsell. “A systematic literature review on fault prediction performance in software engineering”, Accepted for publication in TSE (download from BURA). • Located 208 relevant primary studies • Due to reporting requirements used 18 studies that contain 194 results – binary classifiers, confusion matrix, context details 7
  • 8. Matthews correlation coefficient 8 MCC Dataset$MCC frequency -0.2 0.0 0.2 0.4 0.6 0.8 0102030405060 -2 -1 0 1 2 -0.20.00.20.40.60.8 rnorm(194) Dataset$MCC TABLE IV COMPOSITE PERFORMANCE MEASURES Defined as Description detection) TP/ (TP + F N ) Proportion of faulty units cor TP/ (TP + F P) Proportion of units correctl faulty alse alarm) F P/ (F P + TN ) Proportion of non-faulty un classified TN/ (TN + F P) Proportion of correctly classi units 2·R ecal l ·P r eci si on R ecal l + P r eci si on Most commonly defined as mean of precision and recall ( T N + T P ) (T N + F N + F P + T P ) Proportion of correctly classifi on Coefficient T P ⇥T N − F P ⇥F Np (T P + F P )( T P + F N )(T N + F P )(T N + F N ) Combines all quadrants of th sion matrix to produce avalue to +1 with 0 indicating random tween the prediction and the r MCC can betested for statistic with χ2 = N · M CC2 where number of instances.
  • 10. ANOVA Results Factor % of var Author group 61% Metric family 3% Author/metric 9% Everything else 8% (but not significant) Residuals 19% 10
  • 11. Final word We cannot ignore the fact that the main determinant of a validation study result is which research group undertakes it. 11
  • 12. Know thy tools Stop treating data miners as black boxes. Looking inside is (1) fun, (2) easy, (3) needed. 12