2. The Premises of PROMISE
(2005)
– Wanted: predictions
• Nope. Users want decision, or engagement
3. The Premises of PROMISE
(2005)
– Wanted: predictions
• Nope. Users want decision, or engagement
– Data mining will reveal “the truth” about SE
• [Dejaeger: TSE’11], [Hall: TSE’12], [Shepperd:COW’13]
• Not(Better learners = better conclusions)
4. The Premises of PROMISE
(2005)
– Wanted: predictions
• Nope. Users want decision, or engagement
– Data mining will reveal “the truth” about SE
• [Dejaeger: TSE’11], [Hall: TSE’12], [Shepperd:COW’13]
• Not(Better learners = better conclusions)
– Sooner or later: enough data for general conclusions
• Found more differences than generalities
• Special issues: [IST’13], [ESEj’13]
• Best papers, ASE’11, MSR’12
• Menzies, Zimmermann et al [TSE’13]
• Lots of local models
5. Landscape mining:
look before your leap
• Report what is true about the
data
– Not trivia on how algorithms
walk that data
• Map the landscape
– Reason on each part of map
• E.g. landscape mining
– Unsupervised iterative
dichotomization
– Cluster, prune
– Then generate rules
5
6. Landscape mining:
look before your leap
• Report what is true about the
data
– Not trivia on how algorithms
walk that data
• Map the landscape
– Reason on each part of map
• E.g. landscape mining
– Unsupervised iterative
dichotomization
– Cluster, prune
– Then generate rules
• Different to “leap before you look”
– i.e. skew learning by class variable
– then study the results
• E.g. C4.5, CART, Fayya-Iranni, etc
– Supervised iterative dichotomization
• E.g. 61% * 300+effort estimation
papers
– Algorithm tinkering, without end
6
8. Spectral Landscape Mining
• Spectrum = condition that is not
limited to a specific set of values
but varies in a continuum.
• Groups together a broad range of
conditions or behaviors under
one single title
• In mathematics, the spectrum of
a (finite-dimensional) matrix is
the set of its eigenvalues.
• Nystrom algorithms:
approximations to eigenvalues
– FASTMAP: linear time
9. Project data on first 2 PCA; grid that data
e.g. Nasa93dem
1) project 23 dimensions projected into 2
2a) cluster
2b) replace clusters with centroids.
MOEA: score=
effort+defects
+months
10. Sanity check:
What information loss?
• E.g. POI-3
– 400+ examples
– 20 centroids
• Prediction via:
– Extrapolation between two
nearest centroids
• Works as well as
– Random forest, Naïve Bayes
• For defect prediction (10 data sets)
– Linear regression, M5’
• For effort estimation (10 data sets)
11. • Find delta between neighbors that go worse to better
• Very small rules, found in logLinear time
• Menzies et al. [TSE’13]
11
Planning = Inter-cluster contrast sets
12. Applications
• Prediction
• Planning
• Monitoring
• Multi-objective optimization
– Cluster first on N objectives
• Anomaly detection
• Incremental theory revision
• Compression
• Privacy
• etc
13. Idea Engineering
0. algorithm
mining
1. landscape
mining
2. decision
mining
3. discussion
mining
yesterday today
tomorrow future
Beyond Data Mining, T. Menzies, IEEE Software, 2013, to appear
13
Q: why call it
mining?
• A1: because all the primitives for the above are
in the data mining literature
• So we know how to get from here to there
• A2: because data mining scales