SlideShare ist ein Scribd-Unternehmen logo
1 von 93
Downloaden Sie, um offline zu lesen
ICSE’15 TechnicalBriefing:
The Art and Science of
Analyzing SoftwareData:
Quantitativemethods
Tim Menzies : NcState, USA
Leandro Minku : U. Birmingham, UK
Fayola Peters : Lero, UL, Ireland
http:// unbox.org/open/
trunk/doc/15/icse/techbrief
• Statistics
• Operations research
• Machine Learning
• Data mining
• Predictive Analytics
• Business Intelligence
• Data Science
• Big Data
• Smart Data
• ???
1
What’s
next?
2023 ?
2033 ?
Seek core principles that may
last for longer that just your next
application.
Who we are…
2
Tim Menzies
North Carolina State, USA
tim.menzies@gmail.com
Fayola Peters
LERO, University of Limerick, Ireland,
fayolapeters@gmail.com
Leandro L. Minku
The University of Birmingham, UK
L.L.Minku@cs.bham.ac.uk
Card carry
members of
“the PROMISE
project”
1. Introduction
2. Sharing data
3. Privacy and Sharing
4. Sharing models
5. Summary
3
1. Introduction
2. Sharing data
3. Privacy and Sharing
4. Sharing models
5. Summary
1a. Analyzing software: why?
1b. The PROMISE project
4
1a. Analyzing software: why?
• In the 21st century, too much data:
• Impossible to browse all available software project
• E.g. PROMISE repository of SE data
• grown to 200+ projects standard projects
• 250,000+ spreadsheets
• And a dozen other open-source repositories:
• E.g. see next page
• E.g Feb 2015
• Mozilla Firefox : 1.1 million bug reports,
• GitHub host 14+ million projects.
5
6
1a. Analyzing software: why?
1a. Analyzing software: why?
• Software engineering is so diverse;
• What works there does not here;
• Need cost effective methods for finding best local lessons;
• Every development team needs a team of data scientists.
7
• Research has deserted the individual and
entered the group. The individual worker
find the problem too large, not too difficult.
(They) must learn to work with others.
• Theobald Smith
American pathologist and microbiologist
1859 -- 1934
• If you cannot- in the long run- tell everyone
what you have been doing, your doing has
been worthless.
• Erwin Schrodinger
Nobel Prize winner in physics
1887 -- 1961
8
1b. The PROMISE Project
Ifitworks,trytomakeitbetter
• “The following is my valiant
attempt to capture the
difference (between PROMISE
and MSR)”
• “To misquote George Box, I
hope my model is more useful
than it is wrong:
• For the most part, the MSR
community was mostly
concerned with the initial
collection of data sets from
software projects.
• Meanwhile, the PROMISE
community emphasized the
analysis of the data after it was
collected.”
• “The PROMISE people
routinely posted all their data
on a public repository
• their new papers would re-
analyze old data, in an attempt
to improve that analysis.
• In fact, I used to joke “PROMISE.
Australian for repeatability”
(apologies to the Fosters
Brewing company). “
9
Dr. Prem Devanbu
UC Davis
General chair, MSR’14
1b. The PROMISE Project
The PROMISE repo
openscience.us/repo
#storingYourResearchData
• URL
• openscience.us/repo
• Data from 100s of projects
• E.g. EUSE:
• 250,000K+ spreadsheets
• Oldest continuous
repository of SE data
• For other repos, see
Table 1 of goo.gl/UFZgnd
10
Serveallourdata,on-line
1b. The PROMISE Project
• Initial, naïve, view:
• Collect enough data …
• … and the truth will emerge
• Reality:
• The more data we collected …
• … the more variance we observed
• Its like the microscope zoomed in
• to smash the slide
• So now we routinely slice the data
• Find local lessons in local regions. 11
1b. The PROMISE Project
Challenges
12
Perspective on
Data Science
for Software
Engineering
Tim Menzies
Laurie Williams
Thomas
Zimmermann
2014 2015 2016
1b. The PROMISE Project
Oursummary. And otherrelatedbooks
The MSR
community
and others
13
1b. The PROMISE Project
Thisbriefing
Selected lessons from
“Sharing Data and Models”
1. Introduction
2. Sharing data
3. Privacy and Sharing
4. Sharing models
5. Summary
Step 1: Throw most of it away
Step 2: Share the rest
14
Transferring lessons learned:
Turkish Toasters to NASA Space Ships
15
Burak Turhan, Tim Menzies, Ayşe B. Bener, and Justin Di Stefano. 2009. On the relative value of cross-company and
within-company data for defect prediction. Empirical Softw. Engg. 14, 5 (October 2009),
Q: How to transfer data betweenprojects?
A: Be very cruel to the data
• Ignore most of the data
• relevancy filtering: Turhan ESEj’09; Peters TSE’13, Peters
ICSE’15
• variance filtering: Kocaguneli TSE’12,TSE’13
• performance similarities: He ESEM’13
• Contort the data
• spectral learning (working in PCA
space or some other rotation)
Menzies, TSE’13; Nam, ICSE’13
• Build a bickering committee
• Ensembles Minku, PROMISE’12
16
Q: How to share data?
A: Carve most of it away
Column
pruning
• irrelevancy removal
• better predictions
Row
pruning
• outliers,
• privacy,
• anomaly detection,
incremental learning,
• handling missing
values,
• cross-company
learning
• noise reduction
Range
pruning
• explanation
• optimization
17
Data mining = data carving
Michelangelo
• Every block of stone has a
statue inside it and it is the
task of the sculptor to
discover it.
Someone else
• Every Some stone databases
have statue models inside and
it is the task of the sculptor
data scientist to go look.
18
Data mining = Data Carving
• How to mine:
1. Find the cr*p
2. Cut the cr*p;
3. Goto step1
19
• Eg. Discretization
• Numerics divided
• where class frequencies most change
• If not division,
• then no information in that attribute
• E.g. Classes = (notDiabetic, isDiabetic)
• Baseline distribution = (5: 3)
Mass:
Most change
From raw
BTW, works for rows
as well as columns
• Models are reported from repeated signals,
• So R rows of data must contain repeats
• Otherwise, no model
• Replace all repeats with one exemplar
• Cluster data
• Replace each cluster with
its middle point
20
e.g.
Before: 322 rows * 24 columns
After : 21 cluster * 5 columns
For defect prediction, no information loss
And What About Range Pruning?
• Classes x,y
• Fx, Fy
• frequency of discretized
ranges in x,y
• Log Odds Ratio
• log(Fx/Fy )
• Is zero if no difference in
x,y
• E.g. Data from Norman Fenton’s
Bayes nets discussing software
defects = yes, no
• Do most ranges contribute to
determination of defects?
• Restrict discussion to just most
powerful ranges
21
Learning from
“powerful” ranges
Explanation
• Generate tiny models
• Sort all ranges by their power
• WHICH
1. Select any pair (favoring those with most
power)
2. Combine pair, compute its power
3. Sort back into the ranges
4. Goto 1
• Initially:
• stack contains single ranges
• Subsequently
• stack sets of ranges
Tim Menzies, Zach Milton, Burak Turhan, Bojan Cukic, Yue Jiang, Ayse
Basar Bener: Defect prediction from static code features: current results,
limitations, new approaches. Autom. Softw. Eng. 17(4): 375-407 (2010)
Decision tree
learning on
14 features
WHICH
22
Skip re-entry
• My optimizers vs state of the art numeric optimizers
• My tools: ran 40 times faster
• Generated better solutions
• Powerful succinct explanation tool
23
Automatically Finding the Control Variables for Complex System Behavior Gregory Gay, Tim Menzies, Misty
Davies, and Karen Gundy-Burlet Journal - Automated Software Engineering, 2010 [PDF]
We prune and model works?
So why so few key variables?
• Cause otherwise, no model
• Models = summaries of repeated similar structures in data
• No examples of that structure? Then no model
• Volume n-dimensional sphere Vn = Vn-2 2 r2/n
• Vn shrinks for r=1, when n > 2
• So as complexity grows
• Space for similar things shrinks
• Models are either low
dimensional
• Or not supportable (no data)
24
Applications of pruning
Anomaly detection
• Pass around the reduced
data set
• “Alien”: new data is too “far
away” from the reduced
data
• “Too far”: e.g. 10%
of separation most
distance pair
Incremental learning
• Pass around the
reduced data set
• Add if anomalous:
• For defect data, cache
does not grow beyond
3% of total data
• E.g. LACE2, Peters,
ICSE15
Missing values
• For effort
estimation
– Reasoning by analogy
on all data with missing
“lines of code”
measures
– Hurts estimation
• But after row pruning
(using a reverse nearest
neighbor technique)
– Good estimates, even
without size
– Why? Other features
“stand in” for the
missing size features 25
Other applications of pruning
Noise reduction
• Hierachical cluster
• Throw away sub-trees with
highest variance
• Cluster again
• TEAK, IEEE TSE 2012,
• Exploiting the Essential
Assumptions of Analogy-
Based Effort Estimation
Cross-company learning
• Don’t’ learn from all
data
• Just from training data
in same cluster
• Works even when
data comes from
multiple companies
• EMSE journal, 2009,
relative value of cross-
company and within-
company data
Explanation
• Just show samples in
the cluster nearest
user’s concerns
• Or, list all clusters by
their average
properties and say
“you are here, your
competitors are there.
26
But Why Prune at All?
Why not use all the data?
The original vision
of PROMISE
• With enough data, our
knowledge will stabilize
• But the more data we
collected …
• … the more variance we
observed
• Its like the microscope
zoomed in
• to smash the slide
Software projects
are different
• They change from place to
place.
• They change from time to
time.
• My lessons may not apply
to you
• Your lessons may not even
apply to you (tomorrow).
• Locality, locality, locality 27
Example conclusion instability
Are all these studies wrong?
28
The uncarved block
Michelangelo
• Every block of stone has a
statue inside it and it is the
task of the sculptor to
discover it.
Someone else
• Every Some stone databases
have statue models inside and
it is the task of the sculptor
data scientist to go look.
29
1. Introduction
2. Sharing data
3. Privacy and Sharing
4. Sharing models
5. Summary
Step 1: Throw most of it away
Step 2: Share the rest
30
Why We Care
• Sebastian Elbaum et al.
2014
Sharing industrial datasets with
the research community is
extremely valuable, but also
extremely challenging as it
needs to balance the usefulness
of the dataset with the
industry’s concerns for privacy
and competition.
31
S. Elbaum, A. Mclaughlin, and J. Penix, “The google dataset of testing results,” june 2014. [Online].
Available: https://code.google.com/p/google-shared-dataset-of-test-suite-results
Consider NASA Contractors
• NASA’s software
contractors
• Subject to competitive
bidding every 2 years,
• Unwilling to share data
that would lead to
sensitive attribute
disclosure
• e.g. actual software
development times
32
Sensitive Attribute Disclosure
• A privacy threat.
• Occurs when a target is
associated with
information about their
sensitive attributes
• e.g. software code complexity
or actual software
development times.
33B. C. M. Fung, R. Chen, and P. S. Yu, “Privacy-Preserving Data Publishing: A Survey on Recent Developments,” Computing,
vol. V, no. 4, pp. 1–53, 2010.
J. Brickell and V. Shmatikov, “The cost of privacy: destruction of data-mining utility in anonymized data publishing,” in
Proceeding of the 14th ACM SIGKDD international conference on Knowledge discovery and data mining, ser. KDD ’08.
Software Defect Prediction
34
• For improving inspection efficiency
• But wait! I don’t have enough data.
• Local data not always available
[Zimmermann et al. 2009]
• companies too small;
• product in first release, no past
data;
• no time for data collection;
• new technology can make all data
irrelevant. [Kitchenham et al. 2007]
T. Zimmermann, N. Nagappan, H. Gall, E. Giger, and B. Murphy, “Cross-project defect prediction: a large scale experiment on data
vs. domain vs. process.” in ESEC/SIGSOFT FSE’09, 2009, pp. 91–100.
Kitchenham, Barbara A., Emilia Mendes, and Guilherme H. Travassos. "Cross versus within-company cost estimation studies: A
systematic review." Software Engineering, IEEE Transactions on 33.5 (2007): 316-329
Cross Project Defect Prediction
35
• Use of data from other
sources to build defect
predictors for target data.
• Initial results (Zimmermann et
al. 2009).
644 Cross Defect
Prediction Experiments
Strong (3.4%) Weak (96.6%)
T. Zimmermann, N. Nagappan, H. Gall, E. Giger, and B. Murphy, “Cross-project defect prediction: a large scale experiment on data
vs. domain vs. process.” in ESEC/SIGSOFT FSE’09, 2009, pp. 91–100.
Cross Project Defect Prediction
• Reason for initial results: Data distribution between source
data and target data are different. [Nam et al. 2013]
• Other results have more promising outcome (Turhan et al.
2009, He et al. 2012,2013, Nam et al. 2013).
• Use of data from other sources to build defect predictors for
target data.
• This raises privacy concerns
36
J. Nam, S. J. Pan, and S. Kim, “Transfer defect learning,” in ICSE’13. IEEE Press Piscataway, NJ, USA, 2013, pp. 802–811.
B. Turhan, T. Menzies, A. B. Bener, and J. Di Stefano, “On the relative value of cross-company and within-company data for defect
prediction,” Empirical Software Engineering, vol. 14, pp. 540–578, 2009.
He, Zhimin, et al. "An investigation on the feasibility of cross-project defect prediction." Automated Software Engineering 19.2
(2012): 167-199.
He, Zhimin, et al. "Learning from open-source projects: An empirical study on defect prediction." Empirical Software Engineering
and Measurement, 2013 ACM/IEEE International Symposium on. IEEE, 2013.
What We Want
• By using a privacy framework such as LACE2, you will be able to share an
obfuscated version of your data while having a high level of privacy and
maintaining the usefulness of the data.
• Intuition for LACE2: Software code reuse.
• Don’t share what others have shared.
• In a set of programs, 32% were comprised of reused code (not including
libraries). [Selby 2005] 37
Features Algorithm
Privacy Low sensitive attribute disclosure. ?
Utility Strong defect predictors. ?
Cost
Low memory requirements. ?
Fast runtime. ?
R. Selby, “Enabling reuse-based software development of large-scale systems,” Software Engineering, IEEE Transactions
on, vol. 31, no. 6, pp. 495–510, June 2005.
LACE2: Data Minimization
38
CLIFF: "a=r1" is powerful for
selection for class=yes, i.e. more
common in "yes" than "no".
• P(yes|r1) =
like(yes|r1)2
like(yes|r1) + like(no|r1)
• Step 1: For each class find
ranks of all values;
• Step 2: Multiply ranks of each
row;
• Step 3: Select the most
powerful rows of each class.
F. Peters and T. Menzies, “Privacy and utility for defect prediction: Experiments with morph,” in Proceedings of the 2012 International
Conference on Software Engineering, ser. ICSE 2012. Piscataway, NJ, USA: IEEE Press, 2012, pp. 189–199.
F. Peters, T. Menzies, L. Gong, and H. Zhang, “Balancing privacy and utility in cross-company defect prediction,” Software Engineering, IEEE
Transactions on, vol. 39, no. 8, pp. 1054–1068, Aug 2013.
a b c d class
r1 r1 r1 r2 yes
r1 r2 r3 r2 yes
r1 r3 r3 r3 yes
r4 r4 r4 r4 no
r1 r5 r5 r2 no
r6 r6 r6 r2 no
LACE2: Obfuscation
39
MORPH: Mutate the survivors no
more than half the distance to
their nearest unlike neighbor.
• x is original instance;
• z is nearest unlike neighbor of
x;
• y resulting MORPHed instance;
• r is random.
F. Peters and T. Menzies, “Privacy and utility for defect prediction: Experiments with morph,” in Proceedings of the 2012 International
Conference on Software Engineering, ser. ICSE 2012. Piscataway, NJ, USA: IEEE Press, 2012, pp. 189–199.
F. Peters, T. Menzies, L. Gong, and H. Zhang, “Balancing privacy and utility in cross-company defect prediction,” Software Engineering, IEEE
Transactions on, vol. 39, no. 8, pp. 1054–1068, Aug 2013.
LACE2: Group Sharing
40
• Intuition for LACE2: Software code reuse.
• Don’t share what others have shared.
• In a set of programs, 32% were comprised of reused code (not including libraries).
[Selby 2005]
• LACE2 : Learn from N software projects
• from multiple data owners
• As you learn, play “pass the parcel”
• The cache of reduced data
• Each data owner only adds its “leaders” to the passed cache
• Morphing as they go
• Each data owner determines “leader” according to distance
• separation = distance (d) of farthest 2 instances
• d = separation/10
Duda, Richard O., Peter E. Hart, and David G. Stork. Pattern classification. John Wiley & Sons, 2012.
R. Selby, “Enabling reuse-based software development of large-scale systems,” Software Engineering, IEEE Transactions on, vol. 31, no. 6,
pp. 495–510, June 2005.
LACE2: Sensitive Attribute
Disclosure
• Occurs when a target is associated with information about
their sensitive attributes, (e.g. software code complexity).
• Measured as Increased Privacy Ratio (IPR)
• 100 % = zero sensitive attribute disclosure
• 0% = total sensitive attribute disclosure
41
F. Peters and T. Menzies, “Privacy and utility for defect prediction: Experiments with morph,” in Proceedings of the 2012 International
Conference on Software Engineering, ser. ICSE 2012. Piscataway, NJ, USA: IEEE Press, 2012, pp. 189–199.
F. Peters, T. Menzies, L. Gong, and H. Zhang, “Balancing privacy and utility in cross-company defect prediction,” Software Engineering, IEEE
Transactions on, vol. 39, no. 8, pp. 1054–1068, Aug 2013.
Queries Original Obfuscated Breach
Q1 0 0 yes
Q2 0 1 no
Q3 1 1 yes
no=1/3
IPR=33%
Data
42
Results: Privacy IPRs
43RQ1: Does LACE2 offer more privacy than LACE1?
60
65
70
75
80
85
90
IPR(%)
Proprietary Data
IPRs for LACE1 and LACE2
LACE1
LACE2
• Median IPRs over 10 runs.
• The higher the better.
• 100 % = zero sensitive attribute
disclosure
• 0% = total sensitive attribute
disclosure
7 proprietary
data sets
Results: Privacy IPRs
44RQ1: Does LACE2 offer more privacy than LACE1?
60
65
70
75
80
85
90
IPR(%)
Proprietary Data
IPRs for LACE1 and LACE2
LACE1
LACE2
Result Summary
Features Algorithm
Privacy Low sensitive attribute disclosure. yes
Utility Strong defect predictors. ?
Cost
Low memory requirements*. ?
Fast runtime. ?
45
* Don’t share what others have shared.
Performance Measures
• TP (True Positive): defect-
prone classes that are
classified correctly;
• FN (False Negative): defect-
prone classes that are
wrongly classified to be
defect-free;
• TN (True Negative): defect-
free classes that are classified
correctly;
• FP (False Positive): defect-
free classes that are wrongly
classified to be defect-prone.
46
Results: Defect Prediction
• Median pds relatively
higher for LACE2 for
6/10 data sets
• Five local pd results
are less than 50%
• For ant-1.7, camel-1.6,
ivy-2.0, jEdit-4.1 and
xerces-1.3.
47
RQ2: Does LACE2 offer more useful defect predictors than LACE1 and local?
0
20
40
60
80
100
pd(%)
Test Defect Data Sets
Pds for local and LACE2
local
LACE2
Results: Defect Prediction
48
RQ2: Does LACE2 offer more useful defect predictors than LACE1 and local?
0
10
20
30
40
50
60
70
80
90
pd(%)
Test Defect Data Sets
Pds for LACE1 and LACE2
LACE1
LACE2
Results: Defect Prediction
• Consequence of high pds for LACE2
• Higher pfs (lower is best) than local and LACE1.
49
Pfs for local, LACE1 and LACE2
Data local LACE1 LACE2
jEdit-4.1 5.7 23.4 41.7
ivy-2.0 6.9 31.9 46.3
xerces-1.3 8.0 27.1 33.7
ant-1.7 8.4 34.3 36.8
camel-1.6 11.2 28.2 37.6
lucene-2.4 16.2 24.0 31.1
xalan-2.6 16.2 28.1 27.3
velocity-1.6.1 19.1 22.7 30.3
synapse-1.2 21.2 40.2 55.7
poi-3.0 23.6 16.4 23.8
Results: Defect Prediction
• Consequence of high pds for LACE2
• Increasing pfs (lower is best)
50
Pfs for local, LACE1 and LACE2
Data local LACE1 LACE2
jEdit-4.1 5.7 23.4 41.7
ivy-2.0 6.9 31.9 46.3
xerces-1.3 8.0 27.1 33.7
ant-1.7 8.4 34.3 36.8
camel-1.6 11.2 28.2 37.6
lucene-2.4 16.2 24.0 31.1
xalan-2.6 16.2 28.1 27.3
velocity-1.6.1 19.1 22.7 30.3
synapse-1.2 21.2 40.2 55.7
poi-3.0 23.6 16.4 23.8
Result Summary
Features Algorithm
Privacy Low sensitive attribute disclosure. yes
Utility Strong defect predictors. yes
Cost
Low memory requirements*. ?
Fast runtime. ?
51
* Don’t share what others have shared.
Results: Memory
52
RQ3: Are system costs of LACE2 (memory) worse than LACE1?
0
5
10
15
20
%Datainprivatecache
Proprietary Data
Memory Cost for LACE1 and LACE2
LACE1
LACE2
Result Summary
Features Algorithm
Privacy Low sensitive attribute disclosure. yes
Utility Strong defect predictors. yes
Cost
Low memory requirements*. yes
Fast runtime. ?
53
* Don’t share what others have shared.
Results: Runtime
54
RQ3: Are system costs of LACE2 (runtime) worse than LACE1?
2205
2059
1950 2000 2050 2100 2150 2200 2250
LACE1
LACE2
Time (seconds)
SharingMethods
Runtime Cost for LACE1 and LACE2
Result Summary
Features Algorithm
Privacy Low sensitive attribute disclosure. yes
Utility Strong defect predictors. yes
Cost
Low memory requirements. yes
Fast runtime. yes
55
• LACE2 provides more privacy than LACE1.
• Less data used.
• No loss of predictive efficacy due to the sharing method of LACE2.
• Don’t share what others have shared.
• LACE2’s sharing method, does not take more resources than LACE1.
• By using LACE2, you will be able to share an obfuscated
version of your data while having a high level of privacy and
maintaining the usefulness of the data.
1. Introduction
2. Sharing data
3. Privacy and Sharing
4. Sharing models
5. Summary
4a. Bagging
4b. Comba
4c. DCL
4e. Multi-objective ensembles
56
Ensembles
Artificially generated experts, possibly with slightly different
views on how to solve a problem.
57
Ensembles
Sets of learning machines grouped together with the aim of
improving predictive performance.
58
...
estimation1 estimation2 estimationN
Base learners
E.g.: ensemble estimation = Σ wi estimationi
B1 B2 BN
T. Dietterich. Ensemble Methods in Machine Learning. Proceedings of the First International
Workshop in Multiple Classifier Systems. 2000.
Ensemble Diversity
One of the keys: diversity, i.e., different base learners make
different mistakes on the same instances.
59
Ensemble Versatility
Diversity can be used to address different issues when
estimating software data.
60
Models of
the same
environment
Models with
different
goals
Models of
different
environments
Models of
different
environments
Ensemble Versatility
Diversity can be used to increase stability across data sets.
61
Models of
the same
environment
Models with
different
goals
Models of
different
environments
Bagging Ensembles of Regression
Trees
62
L. Breiman. Bagging Predictors. Machine Learning 24(2):123-140, 1996.
Training data
(completed projects)
Ensemble
RT1 RT2 RTN...
Sample
uniformly with
replacement
Functional Size
Functional Size Effort = 5376
Effort = 1086 Effort = 2798
>= 253< 253
< 151 >= 151
Regression
Trees (RTs)
Regression Trees (RTs):
 Local methods.
 Divide projects
according to attribute
value.
 Most impactful
attributes are in higher
levels.
 Attributes with
insignificant impact are
not used.
 E.g., REPTrees.
WEKA
 Weka: classifiers – meta – bagging
 classifiers – trees – REPTree
63
Increasing Performance Rank
Stability Across Data Sets
 Study with 13 data sets from PROMISE and ISBSG
repositories.
 Bag+RTs:
 Obtained the highest rank across data set in terms of
Mean Absolute Error (MAE).
 Rarely performed considerably worse (>0.1SA, SA = 1 –
MAE / MAErguess) than the best approach:
64
L. Minku, X. Yao. Ensembles and Locality: Insight on Improving Software Effort Estimation. Information and
Software Technology 55(8):1512-1528, 2013.
Comba
65
Kocaguneli, E., Menzies, T. and Keung, J. On the Value of Ensemble Effort Estimation. IEEE
Transactions on Software Engineering, 8(6):1403 – 1416, 2012.
Solo-methods: preprocessing + learning algorithm
Training data
(completed projects)
Ensemble
SNS1 S2 ...
training
SzSa Sb ...Sc
SxSc Sa ... Sk
Rank solo-methods based
on win, loss, win-loss
Select top ranked models with few rank
changes
And sort according to losses
Comba
Experimenting with:
90 solo-methods, 20 public data sets, 7 error measures
66
Kocaguneli, E., Menzies, T. and Keung, J. On the Value of Ensemble Effort Estimation. IEEE
Transactions on Software Engineering, 8(6):1403 – 1416, 2012.
Increasing Rank Stability
Across Data Sets
67
Combine top 2,4,8,13 solo-methods
via mean, median and IRWM
Re-rank solo and multi-methods
together according to #losses
The first ranked multi-method had very low rank-changes.
Ensemble Versatility
Diversity can be used to increase performance on different
measures.
68
Models of
the same
environment
Models with
different
goals
Models of
different
environments
Models of
different
environments
Multi-Objective Ensemble
• There are different measures/metrics of performance
for evaluating SEE models.
• E.g.: MAE, standard deviation, PRED, etc.
• Different measures capture different quality features.
69
• There is no agreed
single measure.
• A model doing well
for a certain measure
may not do so well
for another.
Multi-Objective Ensembles
 We can view SEE as a multi-objective learning problem.
 A multi-objective approach (e.g. Multi-Objective
Evolutionary Algorithm (MOEA)) can be used to:
 Better understand the relationship among measures.
 Create ensembles that do well for a set of measures, in
particular for larger data sets (>=60).
70
L. Minku, X. Yao. Software Effort Estimation as a Multi-objective Learning Problem. ACM
Transactions on Software Engineering and Methodology, 22(4):35, 2013.
Multi-Objective Ensembles
71
Training data
(completed projects)
Ensemble
B1 B2 B3
Multi-objective evolutionary
algorithm creates nondominated
models with several different trade-
offs.
The model with the best performance
in terms of each particular measure
can be picked to form an ensemble
with a good trade-off.
L. Minku, X. Yao. Software Effort Estimation as a Multi-objective Learning Problem. ACM
Transactions on Software Engineering and Methodology, 22(4):35, 2013.
Improving Performance on
Different Measures
 Sample result: Pareto ensemble of MLPs (ISBSG):
 Important:
Using performance measures that behave differently from
each other (low correlation) provide better results than using
performance measures that are highly correlated.
More diversity.
This can even improve results in terms of other measure not
used for training.
72
L. Minku, X. Yao. An Analysis of Multi-objective Evolutionary Algorithms for Training Ensemble Models Based
on Different Performance Measures in Software Effort Estimation. PROMISE, 10p, 2013.
Ensemble Versatility
Diversity can be used to deal with changes and transfer
knowledge.
73
Models of
the same
environment
Models with
different
goals
Models of
different
environments
Models of
different
environments
Companies’ Changing
Environments
Companies are not
static entities – they
can change with time
(concept drift).
• Companies can start
behaving more or less
similarly to other
companies.
74Predicting effort for a single company from ISBSG based
on its projects and other companies' projects.
How to know when a
model from another
company is helpful?
How to improve
performance
throughout time?
Dynamic Cross-Company
Learning (DCL)
75
WC Model
Within-company (WC)
incoming training
data (completed
projects arriving with
time)
CC Model
1
CC Model
2
CC Model
M...
w
DCL learns a weight to reflect the suitability of CC models.
For each new training
project
• If model is not a
winner, multiply its
weight by β (0 < β < 1)
L. Minku, X. Yao. Can Cross-company Data Improve Performance in Software Effort Estimation?
PROMISE, p. 69-78, 2012.
w1 w2 wM
Improving Performance
Throughout Time
• DCL adapts to changes by using CC models.
• DCL manages to use CC models to improve performance over
WC models.
76
Predicting effort for a single company from ISBSG based on its projects and other companies' projects.
Sample Result
Dynamic Cross-Company Mapped
Model Learning (Dycom)
77
WC Model
Within-company (WC)
incoming training
data (completed
projects arriving with
time)
CC
Model
1
CC
Model
2
CC
Model
M...
w1 w2 wM
w
How to use CC models even when they are not directly helpful?
Dycom learns functions
to map CC models to
the WC context.
L. Minku, X. Yao. How to Make Best Use of Cross-Company Data in Software Effort Estimation?
ICSE, p. 446-456, 2014.
Map
1
Map
2
Map
M
Learning Mapping Function
78
where lr is a smoothing factor that allows tuning the emphasis on more
recent examples.
L. Minku, X. Yao. How to Make Best Use of Cross-Company Data in Software Effort Estimation?
ICSE, p. 446-456, 2014.
train
Reducing the Number of Required
WC Training Examples
79
Dycom can achieve similar / better performance while using only
10% of WC data.
Sample
Result
• Relationship between
effort of different
companies for the
same projects.
• Initially, our company
needs initially 2x
effort than company
red.
• Later, it needs only
1.2x effort.
Dycom Insights
80
Online Ensemble Learning in Changing Environmentswww.cs.bham.ac.uk/~minkull
Dycom Insights
81
• Our company needs 2x
effort than company
red.
• How to improve our
company?
Analysing Project Data
Number of projects with each feature value for the 20 CC projects
from the medium productivity CC section and the first 20 WC
projects:
82Both the company and the medium CC section frequently use employees
with high programming language experience.
Analysing Project Data
83
Number of projects with each feature value for the 20 CC projects
from the medium productivity CC section and the first 20 WC
projects:
Medium CC section uses more employees with high virtual machine experience.
So, this is more likely to be a problem for the company. Sensitivity analysis and
project manager knowledge could help to confirm that.
Ensemble Versatility
Diversity can be used to address different issues when
estimating software data.
84
Models of
the same
environment
Models with
different
goals
Models of
different
environments
Models of
different
environments
Increase stability across
data sets.
Deal with changes and transfer knowledge.
Increase performance on
different measures.
1. Introduction
2. Sharing data
3. Privacy and Sharing
4. Sharing models
5. Summary
6a. The past
6b. The future
85
The past
• Focused on minimizing the obfuscation data of software
projects.
• Accomplished for individual data owners as well as data owners
who would want to share data collaboratively.
• Results were promising.
86
The future
• Model-based reasoning
• Gaining more insights from models.
• Considering temporal aspects of software data.
• Taking goals into account in decision-support tools.
87
• Privacy
• Next step : focus on end user
privacy
• when using software apps that
need personal info to function.
88
End of our tale
Building Comba
1. Rank methods according to
win, loss and win – loss
2. δr is the max. rank change
3. Sort methods acc. to loss and
observe δr values
89Top 13 methods were CART & ABE methods
(1NN, 5NN) using different preprocessing
methods.
Performance Measures
90
Mapping Training Examples
91
L. Minku, X. Yao. How to Make Best Use of Cross-Company Data in Software Effort Estimation?
ICSE, p. 446-456, 2014.
Reducing the Number of Required
WC Training Examples
92Dycom’s MAE (and SA), StdDev, RMSE, Corr and LSD were always
similar or better than RT’s (Wilcoxon tests with Holm-Bonferroni
corrections).

Weitere ähnliche Inhalte

Andere mochten auch

Wso2datasciencesummerschool20151 150714180825-lva1-app6892
Wso2datasciencesummerschool20151 150714180825-lva1-app6892Wso2datasciencesummerschool20151 150714180825-lva1-app6892
Wso2datasciencesummerschool20151 150714180825-lva1-app6892WSO2
 
Top Data Science Trends for 2015
Top Data Science Trends for 2015Top Data Science Trends for 2015
Top Data Science Trends for 2015VMware Tanzu
 
Mapping (big) data science (15 dec2014)대학(원)생
Mapping (big) data science (15 dec2014)대학(원)생Mapping (big) data science (15 dec2014)대학(원)생
Mapping (big) data science (15 dec2014)대학(원)생Han Woo PARK
 
Data Science ATL Meetup - Risk I/O Security Data Science
Data Science ATL Meetup - Risk I/O Security Data ScienceData Science ATL Meetup - Risk I/O Security Data Science
Data Science ATL Meetup - Risk I/O Security Data ScienceMichael Roytman
 
Data Science in the Real World: Making a Difference
Data Science in the Real World: Making a Difference Data Science in the Real World: Making a Difference
Data Science in the Real World: Making a Difference Srinath Perera
 
Pivotal Digital Transformation Forum: Accelerate Time to Market with Business...
Pivotal Digital Transformation Forum: Accelerate Time to Market with Business...Pivotal Digital Transformation Forum: Accelerate Time to Market with Business...
Pivotal Digital Transformation Forum: Accelerate Time to Market with Business...VMware Tanzu
 
[2A7]Linkedin'sDataScienceWhyIsItScience
[2A7]Linkedin'sDataScienceWhyIsItScience[2A7]Linkedin'sDataScienceWhyIsItScience
[2A7]Linkedin'sDataScienceWhyIsItScienceNAVER D2
 
Pivotal Digital Transformation Forum: Becoming a Data Driven Enterprise
Pivotal Digital Transformation Forum: Becoming a Data Driven EnterprisePivotal Digital Transformation Forum: Becoming a Data Driven Enterprise
Pivotal Digital Transformation Forum: Becoming a Data Driven EnterpriseVMware Tanzu
 
저성장 시대 데이터 경제만이 살길이다
저성장 시대 데이터 경제만이 살길이다저성장 시대 데이터 경제만이 살길이다
저성장 시대 데이터 경제만이 살길이다eungjin cho
 
Pivotal Digital Transformation Forum: Data Science
Pivotal Digital Transformation Forum: Data Science Pivotal Digital Transformation Forum: Data Science
Pivotal Digital Transformation Forum: Data Science VMware Tanzu
 
Data Science Driven Malware Detection
Data Science Driven Malware DetectionData Science Driven Malware Detection
Data Science Driven Malware DetectionVMware Tanzu
 
Putting the Magic in Data Science
Putting the Magic in Data SciencePutting the Magic in Data Science
Putting the Magic in Data ScienceSean Taylor
 
The Science of a Great Career in Data Science
The Science of a Great Career in Data ScienceThe Science of a Great Career in Data Science
The Science of a Great Career in Data ScienceKate Matsudaira
 
Data Science in Digital Health
Data Science in Digital HealthData Science in Digital Health
Data Science in Digital HealthNeal Lathia
 
Python for Data Science - TDC 2015
Python for Data Science - TDC 2015Python for Data Science - TDC 2015
Python for Data Science - TDC 2015Gabriel Moreira
 
Intro to Data Science for Non-Data Scientists
Intro to Data Science for Non-Data ScientistsIntro to Data Science for Non-Data Scientists
Intro to Data Science for Non-Data ScientistsSri Ambati
 
[FAST CAMPUS] 1강 data science overview
[FAST CAMPUS] 1강 data science overview [FAST CAMPUS] 1강 data science overview
[FAST CAMPUS] 1강 data science overview chanyoonkim
 
Google Dev Summit Extended Seoul - TensorFlow: Tensorboard & Keras
Google Dev Summit Extended Seoul - TensorFlow: Tensorboard & KerasGoogle Dev Summit Extended Seoul - TensorFlow: Tensorboard & Keras
Google Dev Summit Extended Seoul - TensorFlow: Tensorboard & KerasTaegyun Jeon
 
An Overview of AI on the AWS Platform - February 2017 Online Tech Talks
An Overview of AI on the AWS Platform - February 2017 Online Tech TalksAn Overview of AI on the AWS Platform - February 2017 Online Tech Talks
An Overview of AI on the AWS Platform - February 2017 Online Tech TalksAmazon Web Services
 

Andere mochten auch (20)

Wso2datasciencesummerschool20151 150714180825-lva1-app6892
Wso2datasciencesummerschool20151 150714180825-lva1-app6892Wso2datasciencesummerschool20151 150714180825-lva1-app6892
Wso2datasciencesummerschool20151 150714180825-lva1-app6892
 
Top Data Science Trends for 2015
Top Data Science Trends for 2015Top Data Science Trends for 2015
Top Data Science Trends for 2015
 
Mapping (big) data science (15 dec2014)대학(원)생
Mapping (big) data science (15 dec2014)대학(원)생Mapping (big) data science (15 dec2014)대학(원)생
Mapping (big) data science (15 dec2014)대학(원)생
 
Data Science ATL Meetup - Risk I/O Security Data Science
Data Science ATL Meetup - Risk I/O Security Data ScienceData Science ATL Meetup - Risk I/O Security Data Science
Data Science ATL Meetup - Risk I/O Security Data Science
 
Data Science in the Real World: Making a Difference
Data Science in the Real World: Making a Difference Data Science in the Real World: Making a Difference
Data Science in the Real World: Making a Difference
 
Pivotal Digital Transformation Forum: Accelerate Time to Market with Business...
Pivotal Digital Transformation Forum: Accelerate Time to Market with Business...Pivotal Digital Transformation Forum: Accelerate Time to Market with Business...
Pivotal Digital Transformation Forum: Accelerate Time to Market with Business...
 
[2A7]Linkedin'sDataScienceWhyIsItScience
[2A7]Linkedin'sDataScienceWhyIsItScience[2A7]Linkedin'sDataScienceWhyIsItScience
[2A7]Linkedin'sDataScienceWhyIsItScience
 
Pivotal Digital Transformation Forum: Becoming a Data Driven Enterprise
Pivotal Digital Transformation Forum: Becoming a Data Driven EnterprisePivotal Digital Transformation Forum: Becoming a Data Driven Enterprise
Pivotal Digital Transformation Forum: Becoming a Data Driven Enterprise
 
저성장 시대 데이터 경제만이 살길이다
저성장 시대 데이터 경제만이 살길이다저성장 시대 데이터 경제만이 살길이다
저성장 시대 데이터 경제만이 살길이다
 
Pivotal Digital Transformation Forum: Data Science
Pivotal Digital Transformation Forum: Data Science Pivotal Digital Transformation Forum: Data Science
Pivotal Digital Transformation Forum: Data Science
 
Data Science Driven Malware Detection
Data Science Driven Malware DetectionData Science Driven Malware Detection
Data Science Driven Malware Detection
 
Putting the Magic in Data Science
Putting the Magic in Data SciencePutting the Magic in Data Science
Putting the Magic in Data Science
 
The Science of a Great Career in Data Science
The Science of a Great Career in Data ScienceThe Science of a Great Career in Data Science
The Science of a Great Career in Data Science
 
Data Science in Digital Health
Data Science in Digital HealthData Science in Digital Health
Data Science in Digital Health
 
Python for Data Science - TDC 2015
Python for Data Science - TDC 2015Python for Data Science - TDC 2015
Python for Data Science - TDC 2015
 
Intro to Data Science for Non-Data Scientists
Intro to Data Science for Non-Data ScientistsIntro to Data Science for Non-Data Scientists
Intro to Data Science for Non-Data Scientists
 
[FAST CAMPUS] 1강 data science overview
[FAST CAMPUS] 1강 data science overview [FAST CAMPUS] 1강 data science overview
[FAST CAMPUS] 1강 data science overview
 
Google Dev Summit Extended Seoul - TensorFlow: Tensorboard & Keras
Google Dev Summit Extended Seoul - TensorFlow: Tensorboard & KerasGoogle Dev Summit Extended Seoul - TensorFlow: Tensorboard & Keras
Google Dev Summit Extended Seoul - TensorFlow: Tensorboard & Keras
 
An Overview of AI on the AWS Platform - February 2017 Online Tech Talks
An Overview of AI on the AWS Platform - February 2017 Online Tech TalksAn Overview of AI on the AWS Platform - February 2017 Online Tech Talks
An Overview of AI on the AWS Platform - February 2017 Online Tech Talks
 
What Is the Future of Data Sharing?
What Is the Future of Data Sharing?What Is the Future of Data Sharing?
What Is the Future of Data Sharing?
 

Ähnlich wie Icse15 Tech-briefing Data Science

What Metrics Matter?
What Metrics Matter? What Metrics Matter?
What Metrics Matter? CS, NcState
 
Dm sei-tutorial-v7
Dm sei-tutorial-v7Dm sei-tutorial-v7
Dm sei-tutorial-v7CS, NcState
 
The Art and Science of Analyzing Software Data
The Art and Science of Analyzing Software DataThe Art and Science of Analyzing Software Data
The Art and Science of Analyzing Software DataCS, NcState
 
Three Laws of Trusted Data Sharing: (Building a Better Business Case for Dat...
Three Laws of Trusted Data Sharing:(Building a Better Business Case for Dat...Three Laws of Trusted Data Sharing:(Building a Better Business Case for Dat...
Three Laws of Trusted Data Sharing: (Building a Better Business Case for Dat...CS, NcState
 
Icse 2013-tutorial-data-science-for-software-engineering
Icse 2013-tutorial-data-science-for-software-engineeringIcse 2013-tutorial-data-science-for-software-engineering
Icse 2013-tutorial-data-science-for-software-engineeringCS, NcState
 
Tim Menzies, directions in Data Science
Tim Menzies, directions in Data ScienceTim Menzies, directions in Data Science
Tim Menzies, directions in Data ScienceCS, NcState
 
Ml pluss ejan2013
Ml pluss ejan2013Ml pluss ejan2013
Ml pluss ejan2013CS, NcState
 
Machine Learning for Data Extraction
Machine Learning for Data ExtractionMachine Learning for Data Extraction
Machine Learning for Data ExtractionDasha Herrmannova
 
Dagstuhl14 intro-v1
Dagstuhl14 intro-v1Dagstuhl14 intro-v1
Dagstuhl14 intro-v1CS, NcState
 
Three Tools for "Human-in-the-loop" Data Science
Three Tools for "Human-in-the-loop" Data ScienceThree Tools for "Human-in-the-loop" Data Science
Three Tools for "Human-in-the-loop" Data ScienceAditya Parameswaran
 
Lecture_1_Intro.pdf
Lecture_1_Intro.pdfLecture_1_Intro.pdf
Lecture_1_Intro.pdfpaijitk
 
In the age of Big Data, what role for Software Engineers?
In the age of Big Data, what role for Software Engineers?In the age of Big Data, what role for Software Engineers?
In the age of Big Data, what role for Software Engineers?CS, NcState
 
The Materials Project: Experiences from running a million computational scien...
The Materials Project: Experiences from running a million computational scien...The Materials Project: Experiences from running a million computational scien...
The Materials Project: Experiences from running a million computational scien...Anubhav Jain
 
Metrocon-Rise-Of-Crowd-Computing
Metrocon-Rise-Of-Crowd-ComputingMetrocon-Rise-Of-Crowd-Computing
Metrocon-Rise-Of-Crowd-ComputingMatthew Lease
 
Software Engineering Research: Leading a Double-Agent Life.
Software Engineering Research: Leading a Double-Agent Life.Software Engineering Research: Leading a Double-Agent Life.
Software Engineering Research: Leading a Double-Agent Life.Lionel Briand
 
S.P.A.C.E. Exploration for Software Engineering
 S.P.A.C.E. Exploration for Software Engineering S.P.A.C.E. Exploration for Software Engineering
S.P.A.C.E. Exploration for Software EngineeringCS, NcState
 
Statistics in the age of data science, issues you can not ignore
Statistics in the age of data science, issues you can not ignoreStatistics in the age of data science, issues you can not ignore
Statistics in the age of data science, issues you can not ignoreTuri, Inc.
 
A Pragmatic Perspective on Software Visualization
A Pragmatic Perspective on Software VisualizationA Pragmatic Perspective on Software Visualization
A Pragmatic Perspective on Software VisualizationArie van Deursen
 
H2O with Erin LeDell at Portland R User Group
H2O with Erin LeDell at Portland R User GroupH2O with Erin LeDell at Portland R User Group
H2O with Erin LeDell at Portland R User GroupSri Ambati
 

Ähnlich wie Icse15 Tech-briefing Data Science (20)

What Metrics Matter?
What Metrics Matter? What Metrics Matter?
What Metrics Matter?
 
Dm sei-tutorial-v7
Dm sei-tutorial-v7Dm sei-tutorial-v7
Dm sei-tutorial-v7
 
The Art and Science of Analyzing Software Data
The Art and Science of Analyzing Software DataThe Art and Science of Analyzing Software Data
The Art and Science of Analyzing Software Data
 
Three Laws of Trusted Data Sharing: (Building a Better Business Case for Dat...
Three Laws of Trusted Data Sharing:(Building a Better Business Case for Dat...Three Laws of Trusted Data Sharing:(Building a Better Business Case for Dat...
Three Laws of Trusted Data Sharing: (Building a Better Business Case for Dat...
 
Icse 2013-tutorial-data-science-for-software-engineering
Icse 2013-tutorial-data-science-for-software-engineeringIcse 2013-tutorial-data-science-for-software-engineering
Icse 2013-tutorial-data-science-for-software-engineering
 
Tim Menzies, directions in Data Science
Tim Menzies, directions in Data ScienceTim Menzies, directions in Data Science
Tim Menzies, directions in Data Science
 
Ml pluss ejan2013
Ml pluss ejan2013Ml pluss ejan2013
Ml pluss ejan2013
 
Machine Learning for Data Extraction
Machine Learning for Data ExtractionMachine Learning for Data Extraction
Machine Learning for Data Extraction
 
Dagstuhl14 intro-v1
Dagstuhl14 intro-v1Dagstuhl14 intro-v1
Dagstuhl14 intro-v1
 
Three Tools for "Human-in-the-loop" Data Science
Three Tools for "Human-in-the-loop" Data ScienceThree Tools for "Human-in-the-loop" Data Science
Three Tools for "Human-in-the-loop" Data Science
 
Lecture_1_Intro.pdf
Lecture_1_Intro.pdfLecture_1_Intro.pdf
Lecture_1_Intro.pdf
 
In the age of Big Data, what role for Software Engineers?
In the age of Big Data, what role for Software Engineers?In the age of Big Data, what role for Software Engineers?
In the age of Big Data, what role for Software Engineers?
 
The Materials Project: Experiences from running a million computational scien...
The Materials Project: Experiences from running a million computational scien...The Materials Project: Experiences from running a million computational scien...
The Materials Project: Experiences from running a million computational scien...
 
Metrocon-Rise-Of-Crowd-Computing
Metrocon-Rise-Of-Crowd-ComputingMetrocon-Rise-Of-Crowd-Computing
Metrocon-Rise-Of-Crowd-Computing
 
Software Engineering Research: Leading a Double-Agent Life.
Software Engineering Research: Leading a Double-Agent Life.Software Engineering Research: Leading a Double-Agent Life.
Software Engineering Research: Leading a Double-Agent Life.
 
S.P.A.C.E. Exploration for Software Engineering
 S.P.A.C.E. Exploration for Software Engineering S.P.A.C.E. Exploration for Software Engineering
S.P.A.C.E. Exploration for Software Engineering
 
Statistics in the age of data science, issues you can not ignore
Statistics in the age of data science, issues you can not ignoreStatistics in the age of data science, issues you can not ignore
Statistics in the age of data science, issues you can not ignore
 
A Pragmatic Perspective on Software Visualization
A Pragmatic Perspective on Software VisualizationA Pragmatic Perspective on Software Visualization
A Pragmatic Perspective on Software Visualization
 
H2O with Erin LeDell at Portland R User Group
H2O with Erin LeDell at Portland R User GroupH2O with Erin LeDell at Portland R User Group
H2O with Erin LeDell at Portland R User Group
 
Howison si2 keynote
Howison si2 keynoteHowison si2 keynote
Howison si2 keynote
 

Mehr von CS, NcState

Talks2015 novdec
Talks2015 novdecTalks2015 novdec
Talks2015 novdecCS, NcState
 
GALE: Geometric active learning for Search-Based Software Engineering
GALE: Geometric active learning for Search-Based Software EngineeringGALE: Geometric active learning for Search-Based Software Engineering
GALE: Geometric active learning for Search-Based Software EngineeringCS, NcState
 
Big Data: the weakest link
Big Data: the weakest linkBig Data: the weakest link
Big Data: the weakest linkCS, NcState
 
Lexisnexis june9
Lexisnexis june9Lexisnexis june9
Lexisnexis june9CS, NcState
 
Welcome to ICSE NIER’15 (new ideas and emerging results).
Welcome to ICSE NIER’15 (new ideas and emerging results).Welcome to ICSE NIER’15 (new ideas and emerging results).
Welcome to ICSE NIER’15 (new ideas and emerging results).CS, NcState
 
Kits to Find the Bits that Fits
Kits to Find  the Bits that Fits Kits to Find  the Bits that Fits
Kits to Find the Bits that Fits CS, NcState
 
Ai4se lab template
Ai4se lab templateAi4se lab template
Ai4se lab templateCS, NcState
 
Automated Software Enging, Fall 2015, NCSU
Automated Software Enging, Fall 2015, NCSUAutomated Software Enging, Fall 2015, NCSU
Automated Software Enging, Fall 2015, NCSUCS, NcState
 
Requirements Engineering
Requirements EngineeringRequirements Engineering
Requirements EngineeringCS, NcState
 
172529main ken and_tim_software_assurance_research_at_west_virginia
172529main ken and_tim_software_assurance_research_at_west_virginia172529main ken and_tim_software_assurance_research_at_west_virginia
172529main ken and_tim_software_assurance_research_at_west_virginiaCS, NcState
 
Automated Software Engineering
Automated Software EngineeringAutomated Software Engineering
Automated Software EngineeringCS, NcState
 
Next Generation “Treatment Learning” (finding the diamonds in the dust)
Next Generation “Treatment Learning” (finding the diamonds in the dust)Next Generation “Treatment Learning” (finding the diamonds in the dust)
Next Generation “Treatment Learning” (finding the diamonds in the dust)CS, NcState
 
Sayyad slides ase13_v4
Sayyad slides ase13_v4Sayyad slides ase13_v4
Sayyad slides ase13_v4CS, NcState
 
Warning: don't do CS
Warning: don't do CSWarning: don't do CS
Warning: don't do CSCS, NcState
 
How to do better experiments in SE
How to do better experiments in SEHow to do better experiments in SE
How to do better experiments in SECS, NcState
 
Idea Engineering
Idea EngineeringIdea Engineering
Idea EngineeringCS, NcState
 

Mehr von CS, NcState (20)

Talks2015 novdec
Talks2015 novdecTalks2015 novdec
Talks2015 novdec
 
Future se oct15
Future se oct15Future se oct15
Future se oct15
 
GALE: Geometric active learning for Search-Based Software Engineering
GALE: Geometric active learning for Search-Based Software EngineeringGALE: Geometric active learning for Search-Based Software Engineering
GALE: Geometric active learning for Search-Based Software Engineering
 
Big Data: the weakest link
Big Data: the weakest linkBig Data: the weakest link
Big Data: the weakest link
 
Lexisnexis june9
Lexisnexis june9Lexisnexis june9
Lexisnexis june9
 
Welcome to ICSE NIER’15 (new ideas and emerging results).
Welcome to ICSE NIER’15 (new ideas and emerging results).Welcome to ICSE NIER’15 (new ideas and emerging results).
Welcome to ICSE NIER’15 (new ideas and emerging results).
 
Kits to Find the Bits that Fits
Kits to Find  the Bits that Fits Kits to Find  the Bits that Fits
Kits to Find the Bits that Fits
 
Ai4se lab template
Ai4se lab templateAi4se lab template
Ai4se lab template
 
Automated Software Enging, Fall 2015, NCSU
Automated Software Enging, Fall 2015, NCSUAutomated Software Enging, Fall 2015, NCSU
Automated Software Enging, Fall 2015, NCSU
 
Requirements Engineering
Requirements EngineeringRequirements Engineering
Requirements Engineering
 
172529main ken and_tim_software_assurance_research_at_west_virginia
172529main ken and_tim_software_assurance_research_at_west_virginia172529main ken and_tim_software_assurance_research_at_west_virginia
172529main ken and_tim_software_assurance_research_at_west_virginia
 
Automated Software Engineering
Automated Software EngineeringAutomated Software Engineering
Automated Software Engineering
 
Next Generation “Treatment Learning” (finding the diamonds in the dust)
Next Generation “Treatment Learning” (finding the diamonds in the dust)Next Generation “Treatment Learning” (finding the diamonds in the dust)
Next Generation “Treatment Learning” (finding the diamonds in the dust)
 
Goldrush
GoldrushGoldrush
Goldrush
 
Know thy tools
Know thy toolsKnow thy tools
Know thy tools
 
Sayyad slides ase13_v4
Sayyad slides ase13_v4Sayyad slides ase13_v4
Sayyad slides ase13_v4
 
Ase2013
Ase2013Ase2013
Ase2013
 
Warning: don't do CS
Warning: don't do CSWarning: don't do CS
Warning: don't do CS
 
How to do better experiments in SE
How to do better experiments in SEHow to do better experiments in SE
How to do better experiments in SE
 
Idea Engineering
Idea EngineeringIdea Engineering
Idea Engineering
 

Kürzlich hochgeladen

دليل تجارب الاسفلت المختبرية - Asphalt Experiments Guide Laboratory
دليل تجارب الاسفلت المختبرية - Asphalt Experiments Guide Laboratoryدليل تجارب الاسفلت المختبرية - Asphalt Experiments Guide Laboratory
دليل تجارب الاسفلت المختبرية - Asphalt Experiments Guide LaboratoryBahzad5
 
Dev.bg DevOps March 2024 Monitoring & Logging
Dev.bg DevOps March 2024 Monitoring & LoggingDev.bg DevOps March 2024 Monitoring & Logging
Dev.bg DevOps March 2024 Monitoring & LoggingMarian Marinov
 
Best-NO1 Best Rohani Amil In Lahore Kala Ilam In Lahore Kala Jadu Amil In Lah...
Best-NO1 Best Rohani Amil In Lahore Kala Ilam In Lahore Kala Jadu Amil In Lah...Best-NO1 Best Rohani Amil In Lahore Kala Ilam In Lahore Kala Jadu Amil In Lah...
Best-NO1 Best Rohani Amil In Lahore Kala Ilam In Lahore Kala Jadu Amil In Lah...Amil baba
 
The relationship between iot and communication technology
The relationship between iot and communication technologyThe relationship between iot and communication technology
The relationship between iot and communication technologyabdulkadirmukarram03
 
sdfsadopkjpiosufoiasdoifjasldkjfl a asldkjflaskdjflkjsdsdf
sdfsadopkjpiosufoiasdoifjasldkjfl a asldkjflaskdjflkjsdsdfsdfsadopkjpiosufoiasdoifjasldkjfl a asldkjflaskdjflkjsdsdf
sdfsadopkjpiosufoiasdoifjasldkjfl a asldkjflaskdjflkjsdsdfJulia Kaye
 
UNIT4_ESD_wfffffggggggggggggith_ARM.pptx
UNIT4_ESD_wfffffggggggggggggith_ARM.pptxUNIT4_ESD_wfffffggggggggggggith_ARM.pptx
UNIT4_ESD_wfffffggggggggggggith_ARM.pptxrealme6igamerr
 
Renewable Energy & Entrepreneurship Workshop_21Feb2024.pdf
Renewable Energy & Entrepreneurship Workshop_21Feb2024.pdfRenewable Energy & Entrepreneurship Workshop_21Feb2024.pdf
Renewable Energy & Entrepreneurship Workshop_21Feb2024.pdfodunowoeminence2019
 
cloud computing notes for anna university syllabus
cloud computing notes for anna university syllabuscloud computing notes for anna university syllabus
cloud computing notes for anna university syllabusViolet Violet
 
Nodal seismic construction requirements.pptx
Nodal seismic construction requirements.pptxNodal seismic construction requirements.pptx
Nodal seismic construction requirements.pptxwendy cai
 
ASME BPVC 2023 Section I para leer y entender
ASME BPVC 2023 Section I para leer y entenderASME BPVC 2023 Section I para leer y entender
ASME BPVC 2023 Section I para leer y entenderjuancarlos286641
 
solar wireless electric vechicle charging system
solar wireless electric vechicle charging systemsolar wireless electric vechicle charging system
solar wireless electric vechicle charging systemgokuldongala
 
Clutches and brkesSelect any 3 position random motion out of real world and d...
Clutches and brkesSelect any 3 position random motion out of real world and d...Clutches and brkesSelect any 3 position random motion out of real world and d...
Clutches and brkesSelect any 3 position random motion out of real world and d...sahb78428
 
Vertical- Machining - Center - VMC -LMW-Machine-Tool-Division.pptx
Vertical- Machining - Center - VMC -LMW-Machine-Tool-Division.pptxVertical- Machining - Center - VMC -LMW-Machine-Tool-Division.pptx
Vertical- Machining - Center - VMC -LMW-Machine-Tool-Division.pptxLMW Machine Tool Division
 
Guardians and Glitches: Navigating the Duality of Gen AI in AppSec
Guardians and Glitches: Navigating the Duality of Gen AI in AppSecGuardians and Glitches: Navigating the Duality of Gen AI in AppSec
Guardians and Glitches: Navigating the Duality of Gen AI in AppSecTrupti Shiralkar, CISSP
 
Transforming Process Safety Management: Challenges, Benefits, and Transition ...
Transforming Process Safety Management: Challenges, Benefits, and Transition ...Transforming Process Safety Management: Challenges, Benefits, and Transition ...
Transforming Process Safety Management: Challenges, Benefits, and Transition ...soginsider
 
nvidia AI-gtc 2024 partial slide deck.pptx
nvidia AI-gtc 2024 partial slide deck.pptxnvidia AI-gtc 2024 partial slide deck.pptx
nvidia AI-gtc 2024 partial slide deck.pptxjasonsedano2
 
Graphics Primitives and CG Display Devices
Graphics Primitives and CG Display DevicesGraphics Primitives and CG Display Devices
Graphics Primitives and CG Display DevicesDIPIKA83
 

Kürzlich hochgeladen (20)

دليل تجارب الاسفلت المختبرية - Asphalt Experiments Guide Laboratory
دليل تجارب الاسفلت المختبرية - Asphalt Experiments Guide Laboratoryدليل تجارب الاسفلت المختبرية - Asphalt Experiments Guide Laboratory
دليل تجارب الاسفلت المختبرية - Asphalt Experiments Guide Laboratory
 
Lecture 4 .pdf
Lecture 4                              .pdfLecture 4                              .pdf
Lecture 4 .pdf
 
Dev.bg DevOps March 2024 Monitoring & Logging
Dev.bg DevOps March 2024 Monitoring & LoggingDev.bg DevOps March 2024 Monitoring & Logging
Dev.bg DevOps March 2024 Monitoring & Logging
 
Présentation IIRB 2024 Chloe Dufrane.pdf
Présentation IIRB 2024 Chloe Dufrane.pdfPrésentation IIRB 2024 Chloe Dufrane.pdf
Présentation IIRB 2024 Chloe Dufrane.pdf
 
Lecture 2 .pdf
Lecture 2                           .pdfLecture 2                           .pdf
Lecture 2 .pdf
 
Best-NO1 Best Rohani Amil In Lahore Kala Ilam In Lahore Kala Jadu Amil In Lah...
Best-NO1 Best Rohani Amil In Lahore Kala Ilam In Lahore Kala Jadu Amil In Lah...Best-NO1 Best Rohani Amil In Lahore Kala Ilam In Lahore Kala Jadu Amil In Lah...
Best-NO1 Best Rohani Amil In Lahore Kala Ilam In Lahore Kala Jadu Amil In Lah...
 
The relationship between iot and communication technology
The relationship between iot and communication technologyThe relationship between iot and communication technology
The relationship between iot and communication technology
 
sdfsadopkjpiosufoiasdoifjasldkjfl a asldkjflaskdjflkjsdsdf
sdfsadopkjpiosufoiasdoifjasldkjfl a asldkjflaskdjflkjsdsdfsdfsadopkjpiosufoiasdoifjasldkjfl a asldkjflaskdjflkjsdsdf
sdfsadopkjpiosufoiasdoifjasldkjfl a asldkjflaskdjflkjsdsdf
 
UNIT4_ESD_wfffffggggggggggggith_ARM.pptx
UNIT4_ESD_wfffffggggggggggggith_ARM.pptxUNIT4_ESD_wfffffggggggggggggith_ARM.pptx
UNIT4_ESD_wfffffggggggggggggith_ARM.pptx
 
Renewable Energy & Entrepreneurship Workshop_21Feb2024.pdf
Renewable Energy & Entrepreneurship Workshop_21Feb2024.pdfRenewable Energy & Entrepreneurship Workshop_21Feb2024.pdf
Renewable Energy & Entrepreneurship Workshop_21Feb2024.pdf
 
cloud computing notes for anna university syllabus
cloud computing notes for anna university syllabuscloud computing notes for anna university syllabus
cloud computing notes for anna university syllabus
 
Nodal seismic construction requirements.pptx
Nodal seismic construction requirements.pptxNodal seismic construction requirements.pptx
Nodal seismic construction requirements.pptx
 
ASME BPVC 2023 Section I para leer y entender
ASME BPVC 2023 Section I para leer y entenderASME BPVC 2023 Section I para leer y entender
ASME BPVC 2023 Section I para leer y entender
 
solar wireless electric vechicle charging system
solar wireless electric vechicle charging systemsolar wireless electric vechicle charging system
solar wireless electric vechicle charging system
 
Clutches and brkesSelect any 3 position random motion out of real world and d...
Clutches and brkesSelect any 3 position random motion out of real world and d...Clutches and brkesSelect any 3 position random motion out of real world and d...
Clutches and brkesSelect any 3 position random motion out of real world and d...
 
Vertical- Machining - Center - VMC -LMW-Machine-Tool-Division.pptx
Vertical- Machining - Center - VMC -LMW-Machine-Tool-Division.pptxVertical- Machining - Center - VMC -LMW-Machine-Tool-Division.pptx
Vertical- Machining - Center - VMC -LMW-Machine-Tool-Division.pptx
 
Guardians and Glitches: Navigating the Duality of Gen AI in AppSec
Guardians and Glitches: Navigating the Duality of Gen AI in AppSecGuardians and Glitches: Navigating the Duality of Gen AI in AppSec
Guardians and Glitches: Navigating the Duality of Gen AI in AppSec
 
Transforming Process Safety Management: Challenges, Benefits, and Transition ...
Transforming Process Safety Management: Challenges, Benefits, and Transition ...Transforming Process Safety Management: Challenges, Benefits, and Transition ...
Transforming Process Safety Management: Challenges, Benefits, and Transition ...
 
nvidia AI-gtc 2024 partial slide deck.pptx
nvidia AI-gtc 2024 partial slide deck.pptxnvidia AI-gtc 2024 partial slide deck.pptx
nvidia AI-gtc 2024 partial slide deck.pptx
 
Graphics Primitives and CG Display Devices
Graphics Primitives and CG Display DevicesGraphics Primitives and CG Display Devices
Graphics Primitives and CG Display Devices
 

Icse15 Tech-briefing Data Science

  • 1. ICSE’15 TechnicalBriefing: The Art and Science of Analyzing SoftwareData: Quantitativemethods Tim Menzies : NcState, USA Leandro Minku : U. Birmingham, UK Fayola Peters : Lero, UL, Ireland http:// unbox.org/open/ trunk/doc/15/icse/techbrief
  • 2. • Statistics • Operations research • Machine Learning • Data mining • Predictive Analytics • Business Intelligence • Data Science • Big Data • Smart Data • ??? 1 What’s next? 2023 ? 2033 ? Seek core principles that may last for longer that just your next application.
  • 3. Who we are… 2 Tim Menzies North Carolina State, USA tim.menzies@gmail.com Fayola Peters LERO, University of Limerick, Ireland, fayolapeters@gmail.com Leandro L. Minku The University of Birmingham, UK L.L.Minku@cs.bham.ac.uk Card carry members of “the PROMISE project”
  • 4. 1. Introduction 2. Sharing data 3. Privacy and Sharing 4. Sharing models 5. Summary 3
  • 5. 1. Introduction 2. Sharing data 3. Privacy and Sharing 4. Sharing models 5. Summary 1a. Analyzing software: why? 1b. The PROMISE project 4
  • 6. 1a. Analyzing software: why? • In the 21st century, too much data: • Impossible to browse all available software project • E.g. PROMISE repository of SE data • grown to 200+ projects standard projects • 250,000+ spreadsheets • And a dozen other open-source repositories: • E.g. see next page • E.g Feb 2015 • Mozilla Firefox : 1.1 million bug reports, • GitHub host 14+ million projects. 5
  • 8. 1a. Analyzing software: why? • Software engineering is so diverse; • What works there does not here; • Need cost effective methods for finding best local lessons; • Every development team needs a team of data scientists. 7
  • 9. • Research has deserted the individual and entered the group. The individual worker find the problem too large, not too difficult. (They) must learn to work with others. • Theobald Smith American pathologist and microbiologist 1859 -- 1934 • If you cannot- in the long run- tell everyone what you have been doing, your doing has been worthless. • Erwin Schrodinger Nobel Prize winner in physics 1887 -- 1961 8 1b. The PROMISE Project
  • 10. Ifitworks,trytomakeitbetter • “The following is my valiant attempt to capture the difference (between PROMISE and MSR)” • “To misquote George Box, I hope my model is more useful than it is wrong: • For the most part, the MSR community was mostly concerned with the initial collection of data sets from software projects. • Meanwhile, the PROMISE community emphasized the analysis of the data after it was collected.” • “The PROMISE people routinely posted all their data on a public repository • their new papers would re- analyze old data, in an attempt to improve that analysis. • In fact, I used to joke “PROMISE. Australian for repeatability” (apologies to the Fosters Brewing company). “ 9 Dr. Prem Devanbu UC Davis General chair, MSR’14 1b. The PROMISE Project
  • 11. The PROMISE repo openscience.us/repo #storingYourResearchData • URL • openscience.us/repo • Data from 100s of projects • E.g. EUSE: • 250,000K+ spreadsheets • Oldest continuous repository of SE data • For other repos, see Table 1 of goo.gl/UFZgnd 10 Serveallourdata,on-line 1b. The PROMISE Project
  • 12. • Initial, naïve, view: • Collect enough data … • … and the truth will emerge • Reality: • The more data we collected … • … the more variance we observed • Its like the microscope zoomed in • to smash the slide • So now we routinely slice the data • Find local lessons in local regions. 11 1b. The PROMISE Project Challenges
  • 13. 12 Perspective on Data Science for Software Engineering Tim Menzies Laurie Williams Thomas Zimmermann 2014 2015 2016 1b. The PROMISE Project Oursummary. And otherrelatedbooks The MSR community and others
  • 14. 13 1b. The PROMISE Project Thisbriefing Selected lessons from “Sharing Data and Models”
  • 15. 1. Introduction 2. Sharing data 3. Privacy and Sharing 4. Sharing models 5. Summary Step 1: Throw most of it away Step 2: Share the rest 14
  • 16. Transferring lessons learned: Turkish Toasters to NASA Space Ships 15 Burak Turhan, Tim Menzies, Ayşe B. Bener, and Justin Di Stefano. 2009. On the relative value of cross-company and within-company data for defect prediction. Empirical Softw. Engg. 14, 5 (October 2009),
  • 17. Q: How to transfer data betweenprojects? A: Be very cruel to the data • Ignore most of the data • relevancy filtering: Turhan ESEj’09; Peters TSE’13, Peters ICSE’15 • variance filtering: Kocaguneli TSE’12,TSE’13 • performance similarities: He ESEM’13 • Contort the data • spectral learning (working in PCA space or some other rotation) Menzies, TSE’13; Nam, ICSE’13 • Build a bickering committee • Ensembles Minku, PROMISE’12 16
  • 18. Q: How to share data? A: Carve most of it away Column pruning • irrelevancy removal • better predictions Row pruning • outliers, • privacy, • anomaly detection, incremental learning, • handling missing values, • cross-company learning • noise reduction Range pruning • explanation • optimization 17
  • 19. Data mining = data carving Michelangelo • Every block of stone has a statue inside it and it is the task of the sculptor to discover it. Someone else • Every Some stone databases have statue models inside and it is the task of the sculptor data scientist to go look. 18
  • 20. Data mining = Data Carving • How to mine: 1. Find the cr*p 2. Cut the cr*p; 3. Goto step1 19 • Eg. Discretization • Numerics divided • where class frequencies most change • If not division, • then no information in that attribute • E.g. Classes = (notDiabetic, isDiabetic) • Baseline distribution = (5: 3) Mass: Most change From raw
  • 21. BTW, works for rows as well as columns • Models are reported from repeated signals, • So R rows of data must contain repeats • Otherwise, no model • Replace all repeats with one exemplar • Cluster data • Replace each cluster with its middle point 20 e.g. Before: 322 rows * 24 columns After : 21 cluster * 5 columns For defect prediction, no information loss
  • 22. And What About Range Pruning? • Classes x,y • Fx, Fy • frequency of discretized ranges in x,y • Log Odds Ratio • log(Fx/Fy ) • Is zero if no difference in x,y • E.g. Data from Norman Fenton’s Bayes nets discussing software defects = yes, no • Do most ranges contribute to determination of defects? • Restrict discussion to just most powerful ranges 21
  • 23. Learning from “powerful” ranges Explanation • Generate tiny models • Sort all ranges by their power • WHICH 1. Select any pair (favoring those with most power) 2. Combine pair, compute its power 3. Sort back into the ranges 4. Goto 1 • Initially: • stack contains single ranges • Subsequently • stack sets of ranges Tim Menzies, Zach Milton, Burak Turhan, Bojan Cukic, Yue Jiang, Ayse Basar Bener: Defect prediction from static code features: current results, limitations, new approaches. Autom. Softw. Eng. 17(4): 375-407 (2010) Decision tree learning on 14 features WHICH 22
  • 24. Skip re-entry • My optimizers vs state of the art numeric optimizers • My tools: ran 40 times faster • Generated better solutions • Powerful succinct explanation tool 23 Automatically Finding the Control Variables for Complex System Behavior Gregory Gay, Tim Menzies, Misty Davies, and Karen Gundy-Burlet Journal - Automated Software Engineering, 2010 [PDF]
  • 25. We prune and model works? So why so few key variables? • Cause otherwise, no model • Models = summaries of repeated similar structures in data • No examples of that structure? Then no model • Volume n-dimensional sphere Vn = Vn-2 2 r2/n • Vn shrinks for r=1, when n > 2 • So as complexity grows • Space for similar things shrinks • Models are either low dimensional • Or not supportable (no data) 24
  • 26. Applications of pruning Anomaly detection • Pass around the reduced data set • “Alien”: new data is too “far away” from the reduced data • “Too far”: e.g. 10% of separation most distance pair Incremental learning • Pass around the reduced data set • Add if anomalous: • For defect data, cache does not grow beyond 3% of total data • E.g. LACE2, Peters, ICSE15 Missing values • For effort estimation – Reasoning by analogy on all data with missing “lines of code” measures – Hurts estimation • But after row pruning (using a reverse nearest neighbor technique) – Good estimates, even without size – Why? Other features “stand in” for the missing size features 25
  • 27. Other applications of pruning Noise reduction • Hierachical cluster • Throw away sub-trees with highest variance • Cluster again • TEAK, IEEE TSE 2012, • Exploiting the Essential Assumptions of Analogy- Based Effort Estimation Cross-company learning • Don’t’ learn from all data • Just from training data in same cluster • Works even when data comes from multiple companies • EMSE journal, 2009, relative value of cross- company and within- company data Explanation • Just show samples in the cluster nearest user’s concerns • Or, list all clusters by their average properties and say “you are here, your competitors are there. 26
  • 28. But Why Prune at All? Why not use all the data? The original vision of PROMISE • With enough data, our knowledge will stabilize • But the more data we collected … • … the more variance we observed • Its like the microscope zoomed in • to smash the slide Software projects are different • They change from place to place. • They change from time to time. • My lessons may not apply to you • Your lessons may not even apply to you (tomorrow). • Locality, locality, locality 27
  • 29. Example conclusion instability Are all these studies wrong? 28
  • 30. The uncarved block Michelangelo • Every block of stone has a statue inside it and it is the task of the sculptor to discover it. Someone else • Every Some stone databases have statue models inside and it is the task of the sculptor data scientist to go look. 29
  • 31. 1. Introduction 2. Sharing data 3. Privacy and Sharing 4. Sharing models 5. Summary Step 1: Throw most of it away Step 2: Share the rest 30
  • 32. Why We Care • Sebastian Elbaum et al. 2014 Sharing industrial datasets with the research community is extremely valuable, but also extremely challenging as it needs to balance the usefulness of the dataset with the industry’s concerns for privacy and competition. 31 S. Elbaum, A. Mclaughlin, and J. Penix, “The google dataset of testing results,” june 2014. [Online]. Available: https://code.google.com/p/google-shared-dataset-of-test-suite-results
  • 33. Consider NASA Contractors • NASA’s software contractors • Subject to competitive bidding every 2 years, • Unwilling to share data that would lead to sensitive attribute disclosure • e.g. actual software development times 32
  • 34. Sensitive Attribute Disclosure • A privacy threat. • Occurs when a target is associated with information about their sensitive attributes • e.g. software code complexity or actual software development times. 33B. C. M. Fung, R. Chen, and P. S. Yu, “Privacy-Preserving Data Publishing: A Survey on Recent Developments,” Computing, vol. V, no. 4, pp. 1–53, 2010. J. Brickell and V. Shmatikov, “The cost of privacy: destruction of data-mining utility in anonymized data publishing,” in Proceeding of the 14th ACM SIGKDD international conference on Knowledge discovery and data mining, ser. KDD ’08.
  • 35. Software Defect Prediction 34 • For improving inspection efficiency • But wait! I don’t have enough data. • Local data not always available [Zimmermann et al. 2009] • companies too small; • product in first release, no past data; • no time for data collection; • new technology can make all data irrelevant. [Kitchenham et al. 2007] T. Zimmermann, N. Nagappan, H. Gall, E. Giger, and B. Murphy, “Cross-project defect prediction: a large scale experiment on data vs. domain vs. process.” in ESEC/SIGSOFT FSE’09, 2009, pp. 91–100. Kitchenham, Barbara A., Emilia Mendes, and Guilherme H. Travassos. "Cross versus within-company cost estimation studies: A systematic review." Software Engineering, IEEE Transactions on 33.5 (2007): 316-329
  • 36. Cross Project Defect Prediction 35 • Use of data from other sources to build defect predictors for target data. • Initial results (Zimmermann et al. 2009). 644 Cross Defect Prediction Experiments Strong (3.4%) Weak (96.6%) T. Zimmermann, N. Nagappan, H. Gall, E. Giger, and B. Murphy, “Cross-project defect prediction: a large scale experiment on data vs. domain vs. process.” in ESEC/SIGSOFT FSE’09, 2009, pp. 91–100.
  • 37. Cross Project Defect Prediction • Reason for initial results: Data distribution between source data and target data are different. [Nam et al. 2013] • Other results have more promising outcome (Turhan et al. 2009, He et al. 2012,2013, Nam et al. 2013). • Use of data from other sources to build defect predictors for target data. • This raises privacy concerns 36 J. Nam, S. J. Pan, and S. Kim, “Transfer defect learning,” in ICSE’13. IEEE Press Piscataway, NJ, USA, 2013, pp. 802–811. B. Turhan, T. Menzies, A. B. Bener, and J. Di Stefano, “On the relative value of cross-company and within-company data for defect prediction,” Empirical Software Engineering, vol. 14, pp. 540–578, 2009. He, Zhimin, et al. "An investigation on the feasibility of cross-project defect prediction." Automated Software Engineering 19.2 (2012): 167-199. He, Zhimin, et al. "Learning from open-source projects: An empirical study on defect prediction." Empirical Software Engineering and Measurement, 2013 ACM/IEEE International Symposium on. IEEE, 2013.
  • 38. What We Want • By using a privacy framework such as LACE2, you will be able to share an obfuscated version of your data while having a high level of privacy and maintaining the usefulness of the data. • Intuition for LACE2: Software code reuse. • Don’t share what others have shared. • In a set of programs, 32% were comprised of reused code (not including libraries). [Selby 2005] 37 Features Algorithm Privacy Low sensitive attribute disclosure. ? Utility Strong defect predictors. ? Cost Low memory requirements. ? Fast runtime. ? R. Selby, “Enabling reuse-based software development of large-scale systems,” Software Engineering, IEEE Transactions on, vol. 31, no. 6, pp. 495–510, June 2005.
  • 39. LACE2: Data Minimization 38 CLIFF: "a=r1" is powerful for selection for class=yes, i.e. more common in "yes" than "no". • P(yes|r1) = like(yes|r1)2 like(yes|r1) + like(no|r1) • Step 1: For each class find ranks of all values; • Step 2: Multiply ranks of each row; • Step 3: Select the most powerful rows of each class. F. Peters and T. Menzies, “Privacy and utility for defect prediction: Experiments with morph,” in Proceedings of the 2012 International Conference on Software Engineering, ser. ICSE 2012. Piscataway, NJ, USA: IEEE Press, 2012, pp. 189–199. F. Peters, T. Menzies, L. Gong, and H. Zhang, “Balancing privacy and utility in cross-company defect prediction,” Software Engineering, IEEE Transactions on, vol. 39, no. 8, pp. 1054–1068, Aug 2013. a b c d class r1 r1 r1 r2 yes r1 r2 r3 r2 yes r1 r3 r3 r3 yes r4 r4 r4 r4 no r1 r5 r5 r2 no r6 r6 r6 r2 no
  • 40. LACE2: Obfuscation 39 MORPH: Mutate the survivors no more than half the distance to their nearest unlike neighbor. • x is original instance; • z is nearest unlike neighbor of x; • y resulting MORPHed instance; • r is random. F. Peters and T. Menzies, “Privacy and utility for defect prediction: Experiments with morph,” in Proceedings of the 2012 International Conference on Software Engineering, ser. ICSE 2012. Piscataway, NJ, USA: IEEE Press, 2012, pp. 189–199. F. Peters, T. Menzies, L. Gong, and H. Zhang, “Balancing privacy and utility in cross-company defect prediction,” Software Engineering, IEEE Transactions on, vol. 39, no. 8, pp. 1054–1068, Aug 2013.
  • 41. LACE2: Group Sharing 40 • Intuition for LACE2: Software code reuse. • Don’t share what others have shared. • In a set of programs, 32% were comprised of reused code (not including libraries). [Selby 2005] • LACE2 : Learn from N software projects • from multiple data owners • As you learn, play “pass the parcel” • The cache of reduced data • Each data owner only adds its “leaders” to the passed cache • Morphing as they go • Each data owner determines “leader” according to distance • separation = distance (d) of farthest 2 instances • d = separation/10 Duda, Richard O., Peter E. Hart, and David G. Stork. Pattern classification. John Wiley & Sons, 2012. R. Selby, “Enabling reuse-based software development of large-scale systems,” Software Engineering, IEEE Transactions on, vol. 31, no. 6, pp. 495–510, June 2005.
  • 42. LACE2: Sensitive Attribute Disclosure • Occurs when a target is associated with information about their sensitive attributes, (e.g. software code complexity). • Measured as Increased Privacy Ratio (IPR) • 100 % = zero sensitive attribute disclosure • 0% = total sensitive attribute disclosure 41 F. Peters and T. Menzies, “Privacy and utility for defect prediction: Experiments with morph,” in Proceedings of the 2012 International Conference on Software Engineering, ser. ICSE 2012. Piscataway, NJ, USA: IEEE Press, 2012, pp. 189–199. F. Peters, T. Menzies, L. Gong, and H. Zhang, “Balancing privacy and utility in cross-company defect prediction,” Software Engineering, IEEE Transactions on, vol. 39, no. 8, pp. 1054–1068, Aug 2013. Queries Original Obfuscated Breach Q1 0 0 yes Q2 0 1 no Q3 1 1 yes no=1/3 IPR=33%
  • 44. Results: Privacy IPRs 43RQ1: Does LACE2 offer more privacy than LACE1? 60 65 70 75 80 85 90 IPR(%) Proprietary Data IPRs for LACE1 and LACE2 LACE1 LACE2 • Median IPRs over 10 runs. • The higher the better. • 100 % = zero sensitive attribute disclosure • 0% = total sensitive attribute disclosure 7 proprietary data sets
  • 45. Results: Privacy IPRs 44RQ1: Does LACE2 offer more privacy than LACE1? 60 65 70 75 80 85 90 IPR(%) Proprietary Data IPRs for LACE1 and LACE2 LACE1 LACE2
  • 46. Result Summary Features Algorithm Privacy Low sensitive attribute disclosure. yes Utility Strong defect predictors. ? Cost Low memory requirements*. ? Fast runtime. ? 45 * Don’t share what others have shared.
  • 47. Performance Measures • TP (True Positive): defect- prone classes that are classified correctly; • FN (False Negative): defect- prone classes that are wrongly classified to be defect-free; • TN (True Negative): defect- free classes that are classified correctly; • FP (False Positive): defect- free classes that are wrongly classified to be defect-prone. 46
  • 48. Results: Defect Prediction • Median pds relatively higher for LACE2 for 6/10 data sets • Five local pd results are less than 50% • For ant-1.7, camel-1.6, ivy-2.0, jEdit-4.1 and xerces-1.3. 47 RQ2: Does LACE2 offer more useful defect predictors than LACE1 and local? 0 20 40 60 80 100 pd(%) Test Defect Data Sets Pds for local and LACE2 local LACE2
  • 49. Results: Defect Prediction 48 RQ2: Does LACE2 offer more useful defect predictors than LACE1 and local? 0 10 20 30 40 50 60 70 80 90 pd(%) Test Defect Data Sets Pds for LACE1 and LACE2 LACE1 LACE2
  • 50. Results: Defect Prediction • Consequence of high pds for LACE2 • Higher pfs (lower is best) than local and LACE1. 49 Pfs for local, LACE1 and LACE2 Data local LACE1 LACE2 jEdit-4.1 5.7 23.4 41.7 ivy-2.0 6.9 31.9 46.3 xerces-1.3 8.0 27.1 33.7 ant-1.7 8.4 34.3 36.8 camel-1.6 11.2 28.2 37.6 lucene-2.4 16.2 24.0 31.1 xalan-2.6 16.2 28.1 27.3 velocity-1.6.1 19.1 22.7 30.3 synapse-1.2 21.2 40.2 55.7 poi-3.0 23.6 16.4 23.8
  • 51. Results: Defect Prediction • Consequence of high pds for LACE2 • Increasing pfs (lower is best) 50 Pfs for local, LACE1 and LACE2 Data local LACE1 LACE2 jEdit-4.1 5.7 23.4 41.7 ivy-2.0 6.9 31.9 46.3 xerces-1.3 8.0 27.1 33.7 ant-1.7 8.4 34.3 36.8 camel-1.6 11.2 28.2 37.6 lucene-2.4 16.2 24.0 31.1 xalan-2.6 16.2 28.1 27.3 velocity-1.6.1 19.1 22.7 30.3 synapse-1.2 21.2 40.2 55.7 poi-3.0 23.6 16.4 23.8
  • 52. Result Summary Features Algorithm Privacy Low sensitive attribute disclosure. yes Utility Strong defect predictors. yes Cost Low memory requirements*. ? Fast runtime. ? 51 * Don’t share what others have shared.
  • 53. Results: Memory 52 RQ3: Are system costs of LACE2 (memory) worse than LACE1? 0 5 10 15 20 %Datainprivatecache Proprietary Data Memory Cost for LACE1 and LACE2 LACE1 LACE2
  • 54. Result Summary Features Algorithm Privacy Low sensitive attribute disclosure. yes Utility Strong defect predictors. yes Cost Low memory requirements*. yes Fast runtime. ? 53 * Don’t share what others have shared.
  • 55. Results: Runtime 54 RQ3: Are system costs of LACE2 (runtime) worse than LACE1? 2205 2059 1950 2000 2050 2100 2150 2200 2250 LACE1 LACE2 Time (seconds) SharingMethods Runtime Cost for LACE1 and LACE2
  • 56. Result Summary Features Algorithm Privacy Low sensitive attribute disclosure. yes Utility Strong defect predictors. yes Cost Low memory requirements. yes Fast runtime. yes 55 • LACE2 provides more privacy than LACE1. • Less data used. • No loss of predictive efficacy due to the sharing method of LACE2. • Don’t share what others have shared. • LACE2’s sharing method, does not take more resources than LACE1. • By using LACE2, you will be able to share an obfuscated version of your data while having a high level of privacy and maintaining the usefulness of the data.
  • 57. 1. Introduction 2. Sharing data 3. Privacy and Sharing 4. Sharing models 5. Summary 4a. Bagging 4b. Comba 4c. DCL 4e. Multi-objective ensembles 56
  • 58. Ensembles Artificially generated experts, possibly with slightly different views on how to solve a problem. 57
  • 59. Ensembles Sets of learning machines grouped together with the aim of improving predictive performance. 58 ... estimation1 estimation2 estimationN Base learners E.g.: ensemble estimation = Σ wi estimationi B1 B2 BN T. Dietterich. Ensemble Methods in Machine Learning. Proceedings of the First International Workshop in Multiple Classifier Systems. 2000.
  • 60. Ensemble Diversity One of the keys: diversity, i.e., different base learners make different mistakes on the same instances. 59
  • 61. Ensemble Versatility Diversity can be used to address different issues when estimating software data. 60 Models of the same environment Models with different goals Models of different environments Models of different environments
  • 62. Ensemble Versatility Diversity can be used to increase stability across data sets. 61 Models of the same environment Models with different goals Models of different environments
  • 63. Bagging Ensembles of Regression Trees 62 L. Breiman. Bagging Predictors. Machine Learning 24(2):123-140, 1996. Training data (completed projects) Ensemble RT1 RT2 RTN... Sample uniformly with replacement Functional Size Functional Size Effort = 5376 Effort = 1086 Effort = 2798 >= 253< 253 < 151 >= 151 Regression Trees (RTs) Regression Trees (RTs):  Local methods.  Divide projects according to attribute value.  Most impactful attributes are in higher levels.  Attributes with insignificant impact are not used.  E.g., REPTrees.
  • 64. WEKA  Weka: classifiers – meta – bagging  classifiers – trees – REPTree 63
  • 65. Increasing Performance Rank Stability Across Data Sets  Study with 13 data sets from PROMISE and ISBSG repositories.  Bag+RTs:  Obtained the highest rank across data set in terms of Mean Absolute Error (MAE).  Rarely performed considerably worse (>0.1SA, SA = 1 – MAE / MAErguess) than the best approach: 64 L. Minku, X. Yao. Ensembles and Locality: Insight on Improving Software Effort Estimation. Information and Software Technology 55(8):1512-1528, 2013.
  • 66. Comba 65 Kocaguneli, E., Menzies, T. and Keung, J. On the Value of Ensemble Effort Estimation. IEEE Transactions on Software Engineering, 8(6):1403 – 1416, 2012. Solo-methods: preprocessing + learning algorithm Training data (completed projects) Ensemble SNS1 S2 ... training SzSa Sb ...Sc SxSc Sa ... Sk Rank solo-methods based on win, loss, win-loss Select top ranked models with few rank changes And sort according to losses
  • 67. Comba Experimenting with: 90 solo-methods, 20 public data sets, 7 error measures 66 Kocaguneli, E., Menzies, T. and Keung, J. On the Value of Ensemble Effort Estimation. IEEE Transactions on Software Engineering, 8(6):1403 – 1416, 2012.
  • 68. Increasing Rank Stability Across Data Sets 67 Combine top 2,4,8,13 solo-methods via mean, median and IRWM Re-rank solo and multi-methods together according to #losses The first ranked multi-method had very low rank-changes.
  • 69. Ensemble Versatility Diversity can be used to increase performance on different measures. 68 Models of the same environment Models with different goals Models of different environments Models of different environments
  • 70. Multi-Objective Ensemble • There are different measures/metrics of performance for evaluating SEE models. • E.g.: MAE, standard deviation, PRED, etc. • Different measures capture different quality features. 69 • There is no agreed single measure. • A model doing well for a certain measure may not do so well for another.
  • 71. Multi-Objective Ensembles  We can view SEE as a multi-objective learning problem.  A multi-objective approach (e.g. Multi-Objective Evolutionary Algorithm (MOEA)) can be used to:  Better understand the relationship among measures.  Create ensembles that do well for a set of measures, in particular for larger data sets (>=60). 70 L. Minku, X. Yao. Software Effort Estimation as a Multi-objective Learning Problem. ACM Transactions on Software Engineering and Methodology, 22(4):35, 2013.
  • 72. Multi-Objective Ensembles 71 Training data (completed projects) Ensemble B1 B2 B3 Multi-objective evolutionary algorithm creates nondominated models with several different trade- offs. The model with the best performance in terms of each particular measure can be picked to form an ensemble with a good trade-off. L. Minku, X. Yao. Software Effort Estimation as a Multi-objective Learning Problem. ACM Transactions on Software Engineering and Methodology, 22(4):35, 2013.
  • 73. Improving Performance on Different Measures  Sample result: Pareto ensemble of MLPs (ISBSG):  Important: Using performance measures that behave differently from each other (low correlation) provide better results than using performance measures that are highly correlated. More diversity. This can even improve results in terms of other measure not used for training. 72 L. Minku, X. Yao. An Analysis of Multi-objective Evolutionary Algorithms for Training Ensemble Models Based on Different Performance Measures in Software Effort Estimation. PROMISE, 10p, 2013.
  • 74. Ensemble Versatility Diversity can be used to deal with changes and transfer knowledge. 73 Models of the same environment Models with different goals Models of different environments Models of different environments
  • 75. Companies’ Changing Environments Companies are not static entities – they can change with time (concept drift). • Companies can start behaving more or less similarly to other companies. 74Predicting effort for a single company from ISBSG based on its projects and other companies' projects. How to know when a model from another company is helpful? How to improve performance throughout time?
  • 76. Dynamic Cross-Company Learning (DCL) 75 WC Model Within-company (WC) incoming training data (completed projects arriving with time) CC Model 1 CC Model 2 CC Model M... w DCL learns a weight to reflect the suitability of CC models. For each new training project • If model is not a winner, multiply its weight by β (0 < β < 1) L. Minku, X. Yao. Can Cross-company Data Improve Performance in Software Effort Estimation? PROMISE, p. 69-78, 2012. w1 w2 wM
  • 77. Improving Performance Throughout Time • DCL adapts to changes by using CC models. • DCL manages to use CC models to improve performance over WC models. 76 Predicting effort for a single company from ISBSG based on its projects and other companies' projects. Sample Result
  • 78. Dynamic Cross-Company Mapped Model Learning (Dycom) 77 WC Model Within-company (WC) incoming training data (completed projects arriving with time) CC Model 1 CC Model 2 CC Model M... w1 w2 wM w How to use CC models even when they are not directly helpful? Dycom learns functions to map CC models to the WC context. L. Minku, X. Yao. How to Make Best Use of Cross-Company Data in Software Effort Estimation? ICSE, p. 446-456, 2014. Map 1 Map 2 Map M
  • 79. Learning Mapping Function 78 where lr is a smoothing factor that allows tuning the emphasis on more recent examples. L. Minku, X. Yao. How to Make Best Use of Cross-Company Data in Software Effort Estimation? ICSE, p. 446-456, 2014. train
  • 80. Reducing the Number of Required WC Training Examples 79 Dycom can achieve similar / better performance while using only 10% of WC data. Sample Result
  • 81. • Relationship between effort of different companies for the same projects. • Initially, our company needs initially 2x effort than company red. • Later, it needs only 1.2x effort. Dycom Insights 80
  • 82. Online Ensemble Learning in Changing Environmentswww.cs.bham.ac.uk/~minkull Dycom Insights 81 • Our company needs 2x effort than company red. • How to improve our company?
  • 83. Analysing Project Data Number of projects with each feature value for the 20 CC projects from the medium productivity CC section and the first 20 WC projects: 82Both the company and the medium CC section frequently use employees with high programming language experience.
  • 84. Analysing Project Data 83 Number of projects with each feature value for the 20 CC projects from the medium productivity CC section and the first 20 WC projects: Medium CC section uses more employees with high virtual machine experience. So, this is more likely to be a problem for the company. Sensitivity analysis and project manager knowledge could help to confirm that.
  • 85. Ensemble Versatility Diversity can be used to address different issues when estimating software data. 84 Models of the same environment Models with different goals Models of different environments Models of different environments Increase stability across data sets. Deal with changes and transfer knowledge. Increase performance on different measures.
  • 86. 1. Introduction 2. Sharing data 3. Privacy and Sharing 4. Sharing models 5. Summary 6a. The past 6b. The future 85
  • 87. The past • Focused on minimizing the obfuscation data of software projects. • Accomplished for individual data owners as well as data owners who would want to share data collaboratively. • Results were promising. 86
  • 88. The future • Model-based reasoning • Gaining more insights from models. • Considering temporal aspects of software data. • Taking goals into account in decision-support tools. 87 • Privacy • Next step : focus on end user privacy • when using software apps that need personal info to function.
  • 90. Building Comba 1. Rank methods according to win, loss and win – loss 2. δr is the max. rank change 3. Sort methods acc. to loss and observe δr values 89Top 13 methods were CART & ABE methods (1NN, 5NN) using different preprocessing methods.
  • 92. Mapping Training Examples 91 L. Minku, X. Yao. How to Make Best Use of Cross-Company Data in Software Effort Estimation? ICSE, p. 446-456, 2014.
  • 93. Reducing the Number of Required WC Training Examples 92Dycom’s MAE (and SA), StdDev, RMSE, Corr and LSD were always similar or better than RT’s (Wilcoxon tests with Holm-Bonferroni corrections).

Hinweis der Redaktion

  1. Intuitive idea: correct predictions of some may compensate errors of others.
  2. Using different models of the same environment to achieve more stable performance ranking across data sets. Models representing different objectives to improve on more than one objective/performance measure. Models of different environments to help each other in improving a given environment and reducing the number of WC data required for learning.
  3. Using different models of the same environment to achieve more stable performance ranking across data sets. Models representing different objectives to improve on more than one objective/performance measure. Models of different environments to help each other in improving a given environment and reducing the number of WC data required for learning.
  4. Using different models of the same environment to achieve more stable performance ranking across data sets. Models representing different objectives to improve on more than one objective/performance measure. Models of different environments to help each other in improving a given environment and reducing the number of WC data required for learning.
  5. This allows not only to improve performance and reduce the number of WC data required for the learning, but also leads to insights on how to improve a company's productivity. For instance, if we plot these functions that map CC models into the WC context, we can vizualise the relationship between the productivity of these companies. That, together with an analysis of the data from these companies + knowledge from the software manager, can lead to insights on how to improve a company's productivity.
  6. Using different models of the same environment to achieve more stable performance ranking across data sets. Models representing different objectives to improve on more than one objective/performance measure. Models of different environments to help each other in improving a given environment and reducing the number of WC data required for learning.
  7. This slide is here just as a backup. It won’t be presented.