SlideShare ist ein Scribd-Unternehmen logo
1 von 71
Data Science for Software
Engineering (short version)
Tim Menzies, West Virginia University
Fayola Peters, West Virginia University
SEI, August, 2013
SEI http://goo.gl/w4Acsi
ICSE’13 http://goo.gl/29YTMu
0
This talk: reflections on data science
and software analytics
1
Two recent special issues of IEEE Software: July’13; Sept’13.
Editors: Menzies & Zimmermann
• Statistics
• Operations research
• Machine Learning
• Data mining
• Predictive Analytics
• Business Intelligence
• Data Science
• Smart data
• Big Data
2
Insert
buzzword
here
Big data: not-so-successful stories
• Community medicine
– Additional manual
collection required for
their queries
• Software engineering
– Much product data
• examples of source code
– Little process data
• costs, quality measures
We go mining with the data we have,
not the data we want. Get used to it. 3
But what isn’t being said in the all the
above about data mining + SE?
1. Its not just all about algorithms (people matter)
2. Data mining is a technical and a sociological problem
– No point in talking about how to learn lessons from many
organizations…
– …. Unless those organizations let you access their data
– The problem of privacy
3. When we learn from each other
– There is more to sharing that just “you give me your
model”
• Local learning, ensembles, filtering..
4
OUTLINE
• PART 0: Introduction
• PART 1: Organization Issues
– Know your domain
– Data science is cyclic
• PART 2: Data Issues
– How to prune data, simpler & smarter
– How to keep your data private
• PART 3: Models
– Envy-based learning
– Ensembles
5
OUTLINE
• PART 0: Introduction
• PART 1: Organization Issues
– Know your domain
– Data science is cyclic
• PART 2: Data Issues
– How to prune data, simpler & smarter
– How to keep your data private
• PART 3: Models
– Envy-based learning
– Ensembles
6
What can we share?
• Two software project
managers meet
– What can they learn
from each other?
• They can share
1. Data
2. Models
3. Methods
• techniques for turning
data into models
4. Insight into the domain
• The standard mistake
– Generally assumed that
models can be shared,
without modification.
– Yeah, right…
7
SE research = sparse sample of a
very diverse set of activities
8
Microsoft research,
Redmond, Building 99
Other studios,
many other projects
And they are all different.
Models may not move
(effort estimation)
• 20 * 66% samples of
data from NASA
• Linear regression on
each sample to learn
effort = a*LOCb *Σiβixi
• Back select to remove
useless xi
• Result?
– Wide βivariance
9
* T. Menzies, A.Butcher, D.Cok, A.Marcus, L.Layman, F.Shull, B.Turhan, T.Zimmermann, "Local vs. Global Lessons for Defect
Prediction and Effort Estimation," IEEE TSE pre-print 2012. http://menzies.us/pdf/12gense.pdf
Models may not move
(defect prediction)
10* T. Menzies, A.Butcher, D.Cok, A.Marcus, L.Layman, F.Shull, B.Turhan, T.Zimmermann, "Local vs. Global Lessons for Defect Prediction and
Effort Estimation," IEEE TSE pre-print 2012. http://menzies.us/pdf/12gense.pdf
Oh woe is me
• No generality in SE?
• Nothing we can learn
from each other?
• Forever doomed to never
make a conclusion?
– Always, laboriously,
tediously, slowly, learning
specific lessons that hold
only for specific projects?
• No: 3 things we might
want to share
– Models, methods, data
• If no general models, then
– Share methods
• general methods for
quickly turning local data
into local models.
– Share data
• Find and transfer relevant
data from other projects to
us
11
The rest of this tutorial
• Data science
– How to share data
– How to share methods
• Maybe one day, in the future,
– after we’ve shared enough data and methods
– We’ll be able to report general models
• But first,
– Some general notes on data mining
12
OUTLINE
• PART 0: Introduction
• PART 1: Organization Issues
– Know your domain
– Data science is cyclic
• PART 2: Data Issues
– How to prune data, simpler & smarter
– How to keep your data private
• PART 3: Models
– Envy-based learning
– Ensembles
13
OUTLINE
• PART 0: Introduction
• PART 1: Organization Issues
–Know your domain
– Data science is cyclic
• PART 2: Data Issues
– How to prune data, simpler & smarter
– How to keep your data private
• PART 3: Models
– Envy-based learning
– Ensembles
14
Case Study : NASA
• NASA’s Software Engineering Lab, 1990s
– Gave free access to all comers to their data
– But you had to come to get it (to Learn the domain)
– Otherwise: mistakes
• E.g. one class of software module with far more errors that
anything else.
– Dumb data mining algorithms: might learn that this kind of module in
inherently more data prone
• Smart data scientists might question “what kind of
programmer work that module”
– A: we always give that stuff to our beginners as a learning exercise
15* F. Shull, M. Mendonsa, V. Basili, J. Carver, J. Maldonado, S. Fabbri, G. Travassos, and M. Ferreira, "Knowledge-
Sharing Issues in Experimental Software Engineering", EMSE 9(1): 111-137, March 2004.
So algorithms are
only part of the story
16
• Drew Conway, The Data Science Venn Diagram, 2009,
• http://www.dataists.com/2010/09/the-data-science-venn-diagram/
• Dumb data miners miss important
domains semantics
• An ounce of domain knowledge is
worth a ton to algorithms.
• Math and statistics only gets you
machine learning,
• Science is about discovery and building
knowledge, which requires some
motivating questions about the world
• The culture of academia, does not
reward researchers for understanding
domains.
`
• ds
17
Source: Manuel Sevilla, http://goo.gl/cBKIh
Management
misconceptions of Big Data
• All our data analysis problems will be solved
– Once we boot a CPU farm
– Once we bring up Hadoop and Map-reduce
• If your first question is “what tools to buy?”
– Then you are asking the wrong question
18
• Deploy data scientists before deploying tools
Tools can augment, but
not replace, human insight
19Source: http://goo.gl/CCMZo
The great myth
• Wouldn’t it be
wonderful if we did not
have to listen to them
– The dream of
oldeworlde machine
learning
• Circa 1980s
– Dispense with live
experts and resurrect
dead ones.
• But any successful
learner needs biases
– Ways to know what’s
important
• What’s dull
• What can be ignored
– No bias? Can’t ignore
anything
• No summarization
• No generalization
• No way to predict the future
20
Lesson:
TALK TO
THE USERS!
The Inductive
Engineering Manifesto
• Users before algorithms:
– Mining algorithms are only useful in industry if
users fund their use in real-world applications.
• Data science
– Understanding user goals to inductively generate
the models that most matter to the user.
21
• T. Menzies, C. Bird, T. Zimmermann, W. Schulte, and E. Kocaganeli.
The inductive software engineering manifesto. (MALETS '11).
OUTLINE
• PART 0: Introduction
• PART 1: Organization Issues
– Know your domain
–Data science is cyclic
• PART 2: Data Issues
– How to prune data, simpler & smarter
– How to keep your data private
• PART 3: Models
– Envy-based learning
– Ensembles 22
Do it again, and again,
and again, and …
23
In any industrial
application, data science
is repeated multiples
time to either answer an
extra user question,
make some
enhancement and/or
bug fix to the method,
or to deploy it to a
different set of users.
Thou shall not click
• For serious data science studies,
– to ensure repeatability,
– the entire analysis should be automated
– using some high level scripting language;
• e.g. R-script, Matlab, Bash, ….
24
The feedback process
25
The feedback process
26
OUTLINE
• PART 0: Introduction
• PART 1: Organization Issues
– Know your domain
– Data science is cyclic
• PART 2: Data Issues
– How to prune data, simpler & smarter
– How to keep your data private
• PART 3: Models
– Envy-based learning
– Ensembles
27
OUTLINE
• PART 0: Introduction
• PART 1: Organization Issues
– Know your domain
– Data science is cyclic
• PART 2: Data Issues
–How to prune data, simpler &
smarter
– How to keep your data private
• PART 3: Models
– Envy-based learning
– Ensembles
28
How to Prune Data,
Simpler and Smarter
29
Data is the new
oil
And it has a cost too
30
31
Picking random
training instance is
not a good idea
More popular instances
in the active pool
decrease error
One of the stopping
point conditions fires
Data for Industry / Active Learning
X-axis: Instances sorted in decreasing popularity numbers
Y-axis:MedianMRE
32
Data for Industry / Active Learning
At most 31% of all
the cells
On median 10%
Intrinsic dimensionality: There is a consensus in
the high-dimensional data analysis community
that the only reason any methods work in very
high dimensions is that, in fact, the data is not
truly high-dimensional*
* E. Levina and P.J. Bickel. Maximum likelihood estimation of intrinsic dimension. In Advances in
Neural Information Processing Systems, volume 17, Cambridge, MA, USA, 2004. The MIT Press.
OUTLINE
• PART 0: Introduction
• PART 1: Organization Issues
– Know your domain
– Data science is cyclic
• PART 2: Data Issues
– How to prune data, simpler & smarter
–How to keep your data private
• PART 3: Models
– Envy-based learning
– Ensembles
33
Is Data Sharing Worth the Risk to
Individual Privacy
• Former Governor Massachusetts.
• Victim of re-identification privacy breach.
• Led to sensitive attribute disclosure of his medical records.
What would William Weld say?
34
Is Data Sharing Worth the Risk to
Individual Privacy
What about NASA contractors?
Subject to competitive bidding
every 2 years.
Unwilling to share data
that would lead to
sensitive attribute disclosure.
e.g. actual software
development times
35
When To Share – How To Share
So far we cannot guarantee
100% privacy.
What we have is a directive
as to whether data is private
and useful enough to share...
We have a lot of privacy
algorithms geared toward
minimizing risk.
Old School
K-anonymity
L-diversity
T-closeness
But What About Maximizing Benefits (Utility)?
The degree of risk to the
data sharing entity must
not exceed the benefits of
sharing.
36
37
Balancing Privacy and Utility
or...
Minimize risk of privacy disclosure while maximizing utility.
Instance Selection with CLIFF
Small random moves with MORPH
= CLIFF + MORPH
F. Peters, T. Menzies, L. Gong, H. Zhang, "Balancing Privacy and Utility in Cross-Company Defect Prediction," IEEE Transactions on Software Engineering,
24 Jan. 2013. IEEE computer Society Digital Library. IEEE Computer Society
38
CLIFF
Don't share all the data.
F. Peters, T. Menzies, L. Gong, H. Zhang, "Balancing Privacy and Utility in Cross-Company Defect Prediction," IEEE Transactions on Software Engineering,
24 Jan. 2013. IEEE computer Society Digital Library. IEEE Computer Society
39
CLIFF
Don't share all the data.
"a=r1"
powerful for selection for
class=yes
more common in "yes"
than "no"
CLIFF
step1:
for each class find ranks
of all values
F. Peters, T. Menzies, L. Gong, H. Zhang, "Balancing Privacy and Utility in Cross-Company Defect Prediction," IEEE Transactions on Software Engineering,
24 Jan. 2013. IEEE computer Society Digital Library. IEEE Computer Society
40
CLIFF
Don't share all the data.
"a=r1"
powerful for selection for
class=yes
more common in "yes"
than "no"
CLIFF
step2:
multiply ranks of each
row
F. Peters, T. Menzies, L. Gong, H. Zhang, "Balancing Privacy and Utility in Cross-Company Defect Prediction," IEEE Transactions on Software Engineering,
24 Jan. 2013. IEEE computer Society Digital Library. IEEE Computer Society
41
CLIFF
Don't share all the data.
CLIFF
step3: select the most powerful
rows of each class
Note linear time
Can reduce N rows to 0.1N
So an O(N2) NUN algorithm
now
takes time O(0.01)
Scalability
F. Peters, T. Menzies, L. Gong, H. Zhang, "Balancing Privacy and Utility in Cross-Company Defect Prediction," IEEE Transactions on Software Engineering,
24 Jan. 2013. IEEE computer Society Digital Library. IEEE Computer Society
42
MORPH
Push the CLIFF data from their original position.
y = x ± (x − z) ∗ r
x ∈ D, the original
instance
z ∈ D the NUN of x
y the resulting
MORPHed
instance
F. Peters and T. Menzies, “Privacy and utility for defect prediction: Experiments with morph,” in Software Engineering (ICSE), 2012 34th
International Conference on, june 2012, pp. 189 –199.
F. Peters, T. Menzies, L. Gong, H. Zhang, "Balancing Privacy and Utility in Cross-Company Defect Prediction,"
IEEE Transactions on Software Engineering, 24 Jan. 2013. IEEE computer Society Digital Library. IEEE Computer Society
43
Case Study: Cross-Company Defect Prediction (CCDP)
Sharing Required.
Zimmermann et al.
Local data not always
available
• companies too small
• product in first release, so
no past data.
Kitchenham et al.
• no time for collection
• new technology can make all
data irrelevant
T. Zimmermann, N. Nagappan, H. Gall, E. Giger, and B. Murphy, “Cross-project defect prediction: a large scale experiment on data vs. domain vs. process.”
in ESEC/SIGSOFT FSE’09,2009
B. A. Kitchenham, E. Mendes, and G. H. Travassos, “Cross versus within-company cost estimation studies: A systematic review,”
IEEE Transactions on Software Engineering, vol. 33, pp. 316–329, 2007
- Company B has little or no data to build a defect model;
- Company B uses data from Company A to build defect models;
44
Measuring the Risk
IPR = Increased Privacy Ratio
Queries Original Privatized Privacy Breach
Q1 0 0 yes
Q2 0 1 no
Q3 1 1 yes
yes = 2/3
IPR = 1- 2/3 = 0.33
F. Peters, T. Menzies, L. Gong, H. Zhang, "Balancing Privacy and Utility in Cross-Company Defect Prediction," IEEE Transactions on Software Engineering,
24 Jan. 2013. IEEE computer Society Digital Library. IEEE Computer Society
45
Measuring the Utility
The g-measure
Probability of detection (pd)
Probability of False alarm (pf)
Actual
yes no
Predicted yes TP FP
no FN TN
pd TP/(TP+FN)
pf FP/(FP+TN)
g-measure 2*pd*(1-pf)/(pd+(1-pf))
F. Peters, T. Menzies, L. Gong, H. Zhang, "Balancing Privacy and Utility in Cross-Company Defect Prediction," IEEE Transactions on Software Engineering,
24 Jan. 2013. IEEE computer Society Digital Library. IEEE Computer Society
46
Making Data Private for CCDP
Comparing CLIFF+MORPH to Data Swapping and K-anonymity
Data Swapping (s10, s20, s40)
A standard perturbation
technique used for privacy
To implement...
• For each NSA a certainpercent
of the values areswapped with
anyothervalue in that NSA.
• For our experiments,these
percentages are 10, 20 and 40.
k-anonymity (k2, k4)
The Datafly Algorithm.
To implement...
• Make a generalizationhierarchy.
• Replace values in the
NSAaccording to thehierarchy.
• Continue until there are k or
fewer distinct instancesand
suppress them.
K. Taneja, M. Grechanik, R. Ghani, and T. Xie, “Testing software in age of data privacy: a balancing act,” in Proceedings of the 19th ACM SIGSOFT symposium and the 13th European
conference on Foundations of software engineering, ser. ESEC/FSE ’11. New York, NY, USA: ACM, 2011, pp. 201–211.
L. Sweeney, “Achieving k-anonymity privacy protection using generalization and suppression,” Int. J. Uncertain. Fuzziness Knowl.-Based Syst., vol. 10, no. 5, pp. 571–588, Oct. 2002.
47
Making Data Private for CCDP
Comparing CLIFF+MORPH to Data Swapping and K-anonymity
F. Peters, T. Menzies, L. Gong, H. Zhang, "Balancing Privacy and Utility in Cross-Company Defect Prediction," IEEE Transactions on Software Engineering,
24 Jan. 2013. IEEE computer Society Digital Library. IEEE Computer Society
48
Making Data Private for CCDP
Comparing CLIFF+MORPH to Data Swapping and K-anonymity
F. Peters, T. Menzies, L. Gong, H. Zhang, "Balancing Privacy and Utility in Cross-Company Defect Prediction," IEEE Transactions on Software Engineering,
24 Jan. 2013. IEEE computer Society Digital Library. IEEE Computer Society
49
Making Data Private for CCDP
F. Peters, T. Menzies, L. Gong, H. Zhang, "Balancing Privacy and Utility in Cross-Company Defect Prediction," IEEE Transactions on Software Engineering,
24 Jan. 2013. IEEE computer Society Digital Library. IEEE Computer Society
OUTLINE
• PART 0: Introduction
• PART 1: Organization Issues
– Know your domain
– Data science is cyclic
• PART 2: Data Issues
– How to prune data, simpler & smarter
– How to keep your data private
• PART 3: Models
– Envy-based learning
– Ensembles
51
OUTLINE
• PART 0: Introduction
• PART 1: Organization Issues
– Know your domain
– Data science is cyclic
• PART 2: Data Issues
– How to prune data, simpler & smarter
– How to keep your data private
• PART 3: Models
–Envy-based learning
– Ensembles
52
• Seek the fence
where the grass
is greener on the
other side.
• Learn from
there
• Test on here
• Cluster to find
“here” and
“there”
53
Envy =
The WisDOM Of
the COWs
54
@attribute recordnumber real
@attribute projectname {de,erb,gal,X,hst,slp,spl,Y}
@attribute cat2 {Avionics, application_ground, avionicsmonitoring, … }
@attribute center {1,2,3,4,5,6}
@attribute year real
@attribute mode {embedded,organic,semidetached}
@attribute rely {vl,l,n,h,vh,xh}
@attribute data {vl,l,n,h,vh,xh}
…
@attribute equivphyskloc real
@attribute act_effort real
@data
1,de,avionicsmonitoring,g,2,1979,semidetached,h,l,h,n,n,l,l,n,n,n,n,h,h,n,l,25.9,117.6
2,de,avionicsmonitoring,g,2,1979,semidetached,h,l,h,n,n,l,l,n,n,n,n,h,h,n,l,24.6,117.6
3,de,avionicsmonitoring,g,2,1979,semidetached,h,l,h,n,n,l,l,n,n,n,n,h,h,n,l,7.7,31.2
4,de,avionicsmonitoring,g,2,1979,semidetached,h,l,h,n,n,l,l,n,n,n,n,h,h,n,l,8.2,36
5,de,avionicsmonitoring,g,2,1979,semidetached,h,l,h,n,n,l,l,n,n,n,n,h,h,n,l,9.7,25.2
6,de,avionicsmonitoring,g,2,1979,semidetached,h,l,h,n,n,l,l,n,n,n,n,h,h,n,l,2.2,8.4
….
DATA = MULTI-DIMENSIONAL VECTORS
CAUTION: data may not divide neatly
on raw dimensions
• The best description for SE projects may be
synthesize dimensions extracted from the raw
dimensions
55
Fastmap
56
Fastmap: Faloutsos [1995]
O(2N) generation of axis of large variability
• Pick any point W;
• Find X furthest from W,
• Find Y furthest from Y.
c = dist(X,Y)
All points have distance a,b to (X,Y)
• x = (a2 + c2 − b2)/2c
• y= sqrt(a2 – x2)
Find median(x), median(y)
Recurse on four quadrants
Hierarchical partitioning
Prune
• Find two orthogonal dimensions
• Find median(x), median(y)
• Recurse on four quadrants
• Combine quadtree leaves
with similar densities
• Score each cluster by median
score of class variable
57
Grow
Q: why cluster Via FASTMAP?
• A1: Circular methods (e.g. k-means)
assume round clusters.
• But density-based clustering allows
clusters to be any shape
• A2: No need to pre-set the number of
clusters
• A3: cause other methods
(e.g. PCA) are much slower
• Fastmap is the O(2N)
• Unoptimized Python:
58
59
Learning via “envy”
• Seek the fence
where the grass
is greener on the
other side.
• Learn from
there
• Test on here
• Cluster to find
“here” and
“there”
60
Envy =
The WisDOM Of
the COWs
Hierarchical partitioning
Prune
• Find two orthogonal dimensions
• Find median(x), median(y)
• Recurse on four quadrants
• Combine quadtree leaves
with similar densities
• Score each cluster by median
score of class variable
• This cluster envies its neighbor with
better score and max
abs(score(this) - score(neighbor))
61
Grow
Where is grass greenest?
Q: How to learn rules from
neighboring clusters
• A: it doesn’t really matter
– Many competent rule learners
• But to evaluate global vs local rules:
– Use the same rule learner for local vs global rule learning
• This study uses WHICH (Menzies [2010])
– Customizable scoring operator
– Faster termination
– Generates very small rules (good for explanation)
62
Data from
http://promisedata.org/data
• Effort reduction =
{ NasaCoc, China } :
COCOMO or function points
• Defect reduction =
{lucene,xalanjedit,synapse,etc } :
CK metrics(OO)
• Clusters have untreated class
distribution.
• Rules select a subset of the
examples:
– generate a treated class
distribution
•
63
0 20 40 60 80 100
25th
50th
75th
100th
untreated global local
Distributions have percentiles:
Treated with rules
learned from all data
Treated with rules learned
from neighboring cluster
• Lower median efforts/defects (50th percentile)
• Greater stability (75th – 25th percentile)
• Decreased worst case (100th percentile)
By any measure,
Local BETTER THAN GLOBAL
64
Rules learned in each cluster
• What works best “here” does not work “there”
– Misguided to try and tame conclusion instability
– Inherent in the data
•
Can’t tame conclusion instability.
• Instead, you can exploit it
• Learn local lessons that do better than overly generalized global theories
65
OUTLINE
• PART 0: Introduction
• PART 1: Organization Issues
– Know your domain
– Data science is cyclic
• PART 2: Data Issues
– How to prune data, simpler & smarter
– How to keep your data private
• PART 3: Models
– Envy-based learning
–Ensembles
66
67B. Turhan, “On the Dataset Shift Problem in Software Engineering Prediction Models”, Empirical Software Engineering Journal, Vol.17/1-
2, pp.62-74, 2012.
Outlier
‘Detection
’
Relevancy
Filtering
Instance
Weighting
Stratification
Cost
Curves
Mixture
Models
Managing Dataset Shift
Covariate
Shift
Prior
Probability
Shift
Sampling
Imbalanced
Data
Domain
Shift
Source
Component
Shift
Solutions to SE Model Problems/
Ensembles of Learning Machines*
 Sets of learning machines grouped together.
 Aim: to improve predictive performance.
...
estimation1 estimation2 estimationN
Base learners
E.g.: ensemble estimation = Σ wi estimationi
B1 B2 BN
* T. Dietterich. Ensemble Methods in Machine Learning. Proceedings of the First International Workshop in
Multiple Classifier Systems. 2000.
68
Solutions to SE Model Problems/
Ensembles of Learning Machines
 One of the keys:
Diverse* ensemble: “base learners” make different
errors on the same instances.
* G. Brown, J. Wyatt, R. Harris, X. Yao. Diversity Creation Methods: A Survey and Categorisation. Journal of
Information Fusion 6(1): 5-20, 2005.
69
Solutions to SE Model Problems/
Dynamic Adaptive Ensembles
 Dynamic Cross-company Learning (DCL)
DCL uses new completed projects that arrive with time.
DCL determines when CC data is useful.
DCL adapts to changes by using CC data.
Predicting effort for a single company from ISBSG based on its projects and other companies' projects.
* L. Minku, X. Yao. Can Cross-company Data Improve Performance in Software Effort Estimation? Proceedings
of the 8th International Conference on Predictive Models in Software Engineering, p. 69-78, 2012.
http://dx.doi.org/10.1145/2365324.2365334.
70

Weitere ähnliche Inhalte

Was ist angesagt?

Chapter1 introduction
Chapter1 introductionChapter1 introduction
Chapter1 introductionDinesh K
 
Cyber securityeducation may2015
Cyber securityeducation may2015Cyber securityeducation may2015
Cyber securityeducation may2015Mark Guzdial
 
20 most popular data scientists
20 most popular data scientists20 most popular data scientists
20 most popular data scientistsPromptCloud
 
Designing at the Intersection of HCI & AI: Misinformation & Crowdsourced Anno...
Designing at the Intersection of HCI & AI: Misinformation & Crowdsourced Anno...Designing at the Intersection of HCI & AI: Misinformation & Crowdsourced Anno...
Designing at the Intersection of HCI & AI: Misinformation & Crowdsourced Anno...Matthew Lease
 
A Pragmatic Perspective on Software Visualization
A Pragmatic Perspective on Software VisualizationA Pragmatic Perspective on Software Visualization
A Pragmatic Perspective on Software VisualizationArie van Deursen
 
Ethical and Legal Issues in Computational Social Science - Lecture 7 in Intro...
Ethical and Legal Issues in Computational Social Science - Lecture 7 in Intro...Ethical and Legal Issues in Computational Social Science - Lecture 7 in Intro...
Ethical and Legal Issues in Computational Social Science - Lecture 7 in Intro...Lauri Eloranta
 
Adventures in Crowdsourcing : Toward Safer Content Moderation & Better Suppor...
Adventures in Crowdsourcing : Toward Safer Content Moderation & Better Suppor...Adventures in Crowdsourcing : Toward Safer Content Moderation & Better Suppor...
Adventures in Crowdsourcing : Toward Safer Content Moderation & Better Suppor...Matthew Lease
 
From Representation to Mediation: A New Agenda for Conceptual Modeling Resear...
From Representation to Mediation: A New Agenda for Conceptual Modeling Resear...From Representation to Mediation: A New Agenda for Conceptual Modeling Resear...
From Representation to Mediation: A New Agenda for Conceptual Modeling Resear...Jan Recker @ University of Hamburg
 
Machine Learning for Non-technical People
Machine Learning for Non-technical PeopleMachine Learning for Non-technical People
Machine Learning for Non-technical Peopleindico data
 
Machine Learning Pitfalls
Machine Learning Pitfalls Machine Learning Pitfalls
Machine Learning Pitfalls Dan Elton
 
Avoiding Machine Learning Pitfalls 2-10-18
Avoiding Machine Learning Pitfalls 2-10-18Avoiding Machine Learning Pitfalls 2-10-18
Avoiding Machine Learning Pitfalls 2-10-18Dan Elton
 
Causal networks, learning and inference - Introduction
Causal networks, learning and inference - IntroductionCausal networks, learning and inference - Introduction
Causal networks, learning and inference - IntroductionFabio Stella
 
Deep learning for biomedical discovery and data mining I
Deep learning for biomedical discovery and data mining IDeep learning for biomedical discovery and data mining I
Deep learning for biomedical discovery and data mining IDeakin University
 
POWRR Tools: Lessons learned from an IMLS National Leadership Grant
POWRR Tools: Lessons learned from an IMLS National Leadership GrantPOWRR Tools: Lessons learned from an IMLS National Leadership Grant
POWRR Tools: Lessons learned from an IMLS National Leadership GrantLynne Thomas
 
Semantic, Cognitive, and Perceptual Computing – three intertwined strands of ...
Semantic, Cognitive, and Perceptual Computing – three intertwined strands of ...Semantic, Cognitive, and Perceptual Computing – three intertwined strands of ...
Semantic, Cognitive, and Perceptual Computing – three intertwined strands of ...Amit Sheth
 
Key Challenges in Moderating Social Media: Accuracy, Cost, Scalability, and S...
Key Challenges in Moderating Social Media: Accuracy, Cost, Scalability, and S...Key Challenges in Moderating Social Media: Accuracy, Cost, Scalability, and S...
Key Challenges in Moderating Social Media: Accuracy, Cost, Scalability, and S...Matthew Lease
 
After the Pandemic: Rethinking Developer Productivity (There’s more to it th...
After the Pandemic:  Rethinking Developer Productivity (There’s more to it th...After the Pandemic:  Rethinking Developer Productivity (There’s more to it th...
After the Pandemic: Rethinking Developer Productivity (There’s more to it th...Margaret-Anne Storey
 
2017: The Many Faces of Artificial Intelligence: From AI to Big Data - A Hist...
2017: The Many Faces of Artificial Intelligence: From AI to Big Data - A Hist...2017: The Many Faces of Artificial Intelligence: From AI to Big Data - A Hist...
2017: The Many Faces of Artificial Intelligence: From AI to Big Data - A Hist...Leandro de Castro
 

Was ist angesagt? (20)

Figuring out Computer Science
Figuring out Computer ScienceFiguring out Computer Science
Figuring out Computer Science
 
Optimizing Your PhD
Optimizing Your PhDOptimizing Your PhD
Optimizing Your PhD
 
Chapter1 introduction
Chapter1 introductionChapter1 introduction
Chapter1 introduction
 
Cyber securityeducation may2015
Cyber securityeducation may2015Cyber securityeducation may2015
Cyber securityeducation may2015
 
20 most popular data scientists
20 most popular data scientists20 most popular data scientists
20 most popular data scientists
 
Designing at the Intersection of HCI & AI: Misinformation & Crowdsourced Anno...
Designing at the Intersection of HCI & AI: Misinformation & Crowdsourced Anno...Designing at the Intersection of HCI & AI: Misinformation & Crowdsourced Anno...
Designing at the Intersection of HCI & AI: Misinformation & Crowdsourced Anno...
 
A Pragmatic Perspective on Software Visualization
A Pragmatic Perspective on Software VisualizationA Pragmatic Perspective on Software Visualization
A Pragmatic Perspective on Software Visualization
 
Ethical and Legal Issues in Computational Social Science - Lecture 7 in Intro...
Ethical and Legal Issues in Computational Social Science - Lecture 7 in Intro...Ethical and Legal Issues in Computational Social Science - Lecture 7 in Intro...
Ethical and Legal Issues in Computational Social Science - Lecture 7 in Intro...
 
Adventures in Crowdsourcing : Toward Safer Content Moderation & Better Suppor...
Adventures in Crowdsourcing : Toward Safer Content Moderation & Better Suppor...Adventures in Crowdsourcing : Toward Safer Content Moderation & Better Suppor...
Adventures in Crowdsourcing : Toward Safer Content Moderation & Better Suppor...
 
From Representation to Mediation: A New Agenda for Conceptual Modeling Resear...
From Representation to Mediation: A New Agenda for Conceptual Modeling Resear...From Representation to Mediation: A New Agenda for Conceptual Modeling Resear...
From Representation to Mediation: A New Agenda for Conceptual Modeling Resear...
 
Machine Learning for Non-technical People
Machine Learning for Non-technical PeopleMachine Learning for Non-technical People
Machine Learning for Non-technical People
 
Machine Learning Pitfalls
Machine Learning Pitfalls Machine Learning Pitfalls
Machine Learning Pitfalls
 
Avoiding Machine Learning Pitfalls 2-10-18
Avoiding Machine Learning Pitfalls 2-10-18Avoiding Machine Learning Pitfalls 2-10-18
Avoiding Machine Learning Pitfalls 2-10-18
 
Causal networks, learning and inference - Introduction
Causal networks, learning and inference - IntroductionCausal networks, learning and inference - Introduction
Causal networks, learning and inference - Introduction
 
Deep learning for biomedical discovery and data mining I
Deep learning for biomedical discovery and data mining IDeep learning for biomedical discovery and data mining I
Deep learning for biomedical discovery and data mining I
 
POWRR Tools: Lessons learned from an IMLS National Leadership Grant
POWRR Tools: Lessons learned from an IMLS National Leadership GrantPOWRR Tools: Lessons learned from an IMLS National Leadership Grant
POWRR Tools: Lessons learned from an IMLS National Leadership Grant
 
Semantic, Cognitive, and Perceptual Computing – three intertwined strands of ...
Semantic, Cognitive, and Perceptual Computing – three intertwined strands of ...Semantic, Cognitive, and Perceptual Computing – three intertwined strands of ...
Semantic, Cognitive, and Perceptual Computing – three intertwined strands of ...
 
Key Challenges in Moderating Social Media: Accuracy, Cost, Scalability, and S...
Key Challenges in Moderating Social Media: Accuracy, Cost, Scalability, and S...Key Challenges in Moderating Social Media: Accuracy, Cost, Scalability, and S...
Key Challenges in Moderating Social Media: Accuracy, Cost, Scalability, and S...
 
After the Pandemic: Rethinking Developer Productivity (There’s more to it th...
After the Pandemic:  Rethinking Developer Productivity (There’s more to it th...After the Pandemic:  Rethinking Developer Productivity (There’s more to it th...
After the Pandemic: Rethinking Developer Productivity (There’s more to it th...
 
2017: The Many Faces of Artificial Intelligence: From AI to Big Data - A Hist...
2017: The Many Faces of Artificial Intelligence: From AI to Big Data - A Hist...2017: The Many Faces of Artificial Intelligence: From AI to Big Data - A Hist...
2017: The Many Faces of Artificial Intelligence: From AI to Big Data - A Hist...
 

Ähnlich wie Dm sei-tutorial-v7

The Art and Science of Analyzing Software Data
The Art and Science of Analyzing Software DataThe Art and Science of Analyzing Software Data
The Art and Science of Analyzing Software DataCS, NcState
 
Three Laws of Trusted Data Sharing: (Building a Better Business Case for Dat...
Three Laws of Trusted Data Sharing:(Building a Better Business Case for Dat...Three Laws of Trusted Data Sharing:(Building a Better Business Case for Dat...
Three Laws of Trusted Data Sharing: (Building a Better Business Case for Dat...CS, NcState
 
Ml pluss ejan2013
Ml pluss ejan2013Ml pluss ejan2013
Ml pluss ejan2013CS, NcState
 
Icse15 Tech-briefing Data Science
Icse15 Tech-briefing Data ScienceIcse15 Tech-briefing Data Science
Icse15 Tech-briefing Data ScienceCS, NcState
 
Ci2004-10.doc
Ci2004-10.docCi2004-10.doc
Ci2004-10.docbutest
 
Data Mining and Big Data Challenges and Research Opportunities
Data Mining and Big Data Challenges and Research OpportunitiesData Mining and Big Data Challenges and Research Opportunities
Data Mining and Big Data Challenges and Research OpportunitiesKathirvel Ayyaswamy
 
chalenges and apportunity of deep learning for big data analysis f
 chalenges and apportunity of deep learning for big data analysis f chalenges and apportunity of deep learning for big data analysis f
chalenges and apportunity of deep learning for big data analysis fmaru kindeneh
 
GTU GeekDay 2019 Limitations of Artificial Intelligence
GTU GeekDay 2019 Limitations of Artificial IntelligenceGTU GeekDay 2019 Limitations of Artificial Intelligence
GTU GeekDay 2019 Limitations of Artificial IntelligenceKürşat İNCE
 
Dagstuhl14 intro-v1
Dagstuhl14 intro-v1Dagstuhl14 intro-v1
Dagstuhl14 intro-v1CS, NcState
 
Lecture_1_Intro.pdf
Lecture_1_Intro.pdfLecture_1_Intro.pdf
Lecture_1_Intro.pdfpaijitk
 
No Free Lunch: Metadata in the life sciences
No Free Lunch:  Metadata in the life sciencesNo Free Lunch:  Metadata in the life sciences
No Free Lunch: Metadata in the life sciencesChris Dwan
 
Data science meetup - Spiros Antonatos
Data science meetup - Spiros AntonatosData science meetup - Spiros Antonatos
Data science meetup - Spiros AntonatosSpiros Antonatos
 
Big Data and HR - Talk @SwissHR Congress
Big Data and HR - Talk @SwissHR CongressBig Data and HR - Talk @SwissHR Congress
Big Data and HR - Talk @SwissHR CongressMarcel Blattner, PhD
 
Social network architecture - Part 3. Big data - Machine learning
Social network architecture - Part 3. Big data - Machine learningSocial network architecture - Part 3. Big data - Machine learning
Social network architecture - Part 3. Big data - Machine learningPhu Luong Trong
 
Data Science at Scale - The DevOps Approach
Data Science at Scale - The DevOps ApproachData Science at Scale - The DevOps Approach
Data Science at Scale - The DevOps ApproachMihai Criveti
 
Thinkful DC - Intro to Data Science
Thinkful DC - Intro to Data Science Thinkful DC - Intro to Data Science
Thinkful DC - Intro to Data Science TJ Stalcup
 
Real-time applications of Data Science.pptx
Real-time applications  of Data Science.pptxReal-time applications  of Data Science.pptx
Real-time applications of Data Science.pptxshalini s
 

Ähnlich wie Dm sei-tutorial-v7 (20)

The Art and Science of Analyzing Software Data
The Art and Science of Analyzing Software DataThe Art and Science of Analyzing Software Data
The Art and Science of Analyzing Software Data
 
Three Laws of Trusted Data Sharing: (Building a Better Business Case for Dat...
Three Laws of Trusted Data Sharing:(Building a Better Business Case for Dat...Three Laws of Trusted Data Sharing:(Building a Better Business Case for Dat...
Three Laws of Trusted Data Sharing: (Building a Better Business Case for Dat...
 
Ml pluss ejan2013
Ml pluss ejan2013Ml pluss ejan2013
Ml pluss ejan2013
 
Icse15 Tech-briefing Data Science
Icse15 Tech-briefing Data ScienceIcse15 Tech-briefing Data Science
Icse15 Tech-briefing Data Science
 
DBMS
DBMSDBMS
DBMS
 
Ci2004-10.doc
Ci2004-10.docCi2004-10.doc
Ci2004-10.doc
 
Data Mining and Big Data Challenges and Research Opportunities
Data Mining and Big Data Challenges and Research OpportunitiesData Mining and Big Data Challenges and Research Opportunities
Data Mining and Big Data Challenges and Research Opportunities
 
DataScience_introduction.pdf
DataScience_introduction.pdfDataScience_introduction.pdf
DataScience_introduction.pdf
 
chalenges and apportunity of deep learning for big data analysis f
 chalenges and apportunity of deep learning for big data analysis f chalenges and apportunity of deep learning for big data analysis f
chalenges and apportunity of deep learning for big data analysis f
 
GTU GeekDay 2019 Limitations of Artificial Intelligence
GTU GeekDay 2019 Limitations of Artificial IntelligenceGTU GeekDay 2019 Limitations of Artificial Intelligence
GTU GeekDay 2019 Limitations of Artificial Intelligence
 
Dagstuhl14 intro-v1
Dagstuhl14 intro-v1Dagstuhl14 intro-v1
Dagstuhl14 intro-v1
 
Lecture_1_Intro.pdf
Lecture_1_Intro.pdfLecture_1_Intro.pdf
Lecture_1_Intro.pdf
 
No Free Lunch: Metadata in the life sciences
No Free Lunch:  Metadata in the life sciencesNo Free Lunch:  Metadata in the life sciences
No Free Lunch: Metadata in the life sciences
 
Data science meetup - Spiros Antonatos
Data science meetup - Spiros AntonatosData science meetup - Spiros Antonatos
Data science meetup - Spiros Antonatos
 
Big Data and HR - Talk @SwissHR Congress
Big Data and HR - Talk @SwissHR CongressBig Data and HR - Talk @SwissHR Congress
Big Data and HR - Talk @SwissHR Congress
 
Social network architecture - Part 3. Big data - Machine learning
Social network architecture - Part 3. Big data - Machine learningSocial network architecture - Part 3. Big data - Machine learning
Social network architecture - Part 3. Big data - Machine learning
 
Data Science at Scale - The DevOps Approach
Data Science at Scale - The DevOps ApproachData Science at Scale - The DevOps Approach
Data Science at Scale - The DevOps Approach
 
Thinkful DC - Intro to Data Science
Thinkful DC - Intro to Data Science Thinkful DC - Intro to Data Science
Thinkful DC - Intro to Data Science
 
Real-time applications of Data Science.pptx
Real-time applications  of Data Science.pptxReal-time applications  of Data Science.pptx
Real-time applications of Data Science.pptx
 
Data science unit1
Data science unit1Data science unit1
Data science unit1
 

Mehr von CS, NcState

Talks2015 novdec
Talks2015 novdecTalks2015 novdec
Talks2015 novdecCS, NcState
 
GALE: Geometric active learning for Search-Based Software Engineering
GALE: Geometric active learning for Search-Based Software EngineeringGALE: Geometric active learning for Search-Based Software Engineering
GALE: Geometric active learning for Search-Based Software EngineeringCS, NcState
 
Big Data: the weakest link
Big Data: the weakest linkBig Data: the weakest link
Big Data: the weakest linkCS, NcState
 
Lexisnexis june9
Lexisnexis june9Lexisnexis june9
Lexisnexis june9CS, NcState
 
Welcome to ICSE NIER’15 (new ideas and emerging results).
Welcome to ICSE NIER’15 (new ideas and emerging results).Welcome to ICSE NIER’15 (new ideas and emerging results).
Welcome to ICSE NIER’15 (new ideas and emerging results).CS, NcState
 
Kits to Find the Bits that Fits
Kits to Find  the Bits that Fits Kits to Find  the Bits that Fits
Kits to Find the Bits that Fits CS, NcState
 
Ai4se lab template
Ai4se lab templateAi4se lab template
Ai4se lab templateCS, NcState
 
Automated Software Enging, Fall 2015, NCSU
Automated Software Enging, Fall 2015, NCSUAutomated Software Enging, Fall 2015, NCSU
Automated Software Enging, Fall 2015, NCSUCS, NcState
 
Requirements Engineering
Requirements EngineeringRequirements Engineering
Requirements EngineeringCS, NcState
 
172529main ken and_tim_software_assurance_research_at_west_virginia
172529main ken and_tim_software_assurance_research_at_west_virginia172529main ken and_tim_software_assurance_research_at_west_virginia
172529main ken and_tim_software_assurance_research_at_west_virginiaCS, NcState
 
Automated Software Engineering
Automated Software EngineeringAutomated Software Engineering
Automated Software EngineeringCS, NcState
 
Next Generation “Treatment Learning” (finding the diamonds in the dust)
Next Generation “Treatment Learning” (finding the diamonds in the dust)Next Generation “Treatment Learning” (finding the diamonds in the dust)
Next Generation “Treatment Learning” (finding the diamonds in the dust)CS, NcState
 
Tim Menzies, directions in Data Science
Tim Menzies, directions in Data ScienceTim Menzies, directions in Data Science
Tim Menzies, directions in Data ScienceCS, NcState
 
What Metrics Matter?
What Metrics Matter? What Metrics Matter?
What Metrics Matter? CS, NcState
 
In the age of Big Data, what role for Software Engineers?
In the age of Big Data, what role for Software Engineers?In the age of Big Data, what role for Software Engineers?
In the age of Big Data, what role for Software Engineers?CS, NcState
 
Sayyad slides ase13_v4
Sayyad slides ase13_v4Sayyad slides ase13_v4
Sayyad slides ase13_v4CS, NcState
 

Mehr von CS, NcState (20)

Talks2015 novdec
Talks2015 novdecTalks2015 novdec
Talks2015 novdec
 
Future se oct15
Future se oct15Future se oct15
Future se oct15
 
GALE: Geometric active learning for Search-Based Software Engineering
GALE: Geometric active learning for Search-Based Software EngineeringGALE: Geometric active learning for Search-Based Software Engineering
GALE: Geometric active learning for Search-Based Software Engineering
 
Big Data: the weakest link
Big Data: the weakest linkBig Data: the weakest link
Big Data: the weakest link
 
Lexisnexis june9
Lexisnexis june9Lexisnexis june9
Lexisnexis june9
 
Welcome to ICSE NIER’15 (new ideas and emerging results).
Welcome to ICSE NIER’15 (new ideas and emerging results).Welcome to ICSE NIER’15 (new ideas and emerging results).
Welcome to ICSE NIER’15 (new ideas and emerging results).
 
Kits to Find the Bits that Fits
Kits to Find  the Bits that Fits Kits to Find  the Bits that Fits
Kits to Find the Bits that Fits
 
Ai4se lab template
Ai4se lab templateAi4se lab template
Ai4se lab template
 
Automated Software Enging, Fall 2015, NCSU
Automated Software Enging, Fall 2015, NCSUAutomated Software Enging, Fall 2015, NCSU
Automated Software Enging, Fall 2015, NCSU
 
Requirements Engineering
Requirements EngineeringRequirements Engineering
Requirements Engineering
 
172529main ken and_tim_software_assurance_research_at_west_virginia
172529main ken and_tim_software_assurance_research_at_west_virginia172529main ken and_tim_software_assurance_research_at_west_virginia
172529main ken and_tim_software_assurance_research_at_west_virginia
 
Automated Software Engineering
Automated Software EngineeringAutomated Software Engineering
Automated Software Engineering
 
Next Generation “Treatment Learning” (finding the diamonds in the dust)
Next Generation “Treatment Learning” (finding the diamonds in the dust)Next Generation “Treatment Learning” (finding the diamonds in the dust)
Next Generation “Treatment Learning” (finding the diamonds in the dust)
 
Tim Menzies, directions in Data Science
Tim Menzies, directions in Data ScienceTim Menzies, directions in Data Science
Tim Menzies, directions in Data Science
 
Goldrush
GoldrushGoldrush
Goldrush
 
Know thy tools
Know thy toolsKnow thy tools
Know thy tools
 
What Metrics Matter?
What Metrics Matter? What Metrics Matter?
What Metrics Matter?
 
In the age of Big Data, what role for Software Engineers?
In the age of Big Data, what role for Software Engineers?In the age of Big Data, what role for Software Engineers?
In the age of Big Data, what role for Software Engineers?
 
Sayyad slides ase13_v4
Sayyad slides ase13_v4Sayyad slides ase13_v4
Sayyad slides ase13_v4
 
Ase2013
Ase2013Ase2013
Ase2013
 

Kürzlich hochgeladen

Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024The Digital Insurer
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FMESafe Software
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc
 
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ..."I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...Zilliz
 
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...apidays
 
Manulife - Insurer Transformation Award 2024
Manulife - Insurer Transformation Award 2024Manulife - Insurer Transformation Award 2024
Manulife - Insurer Transformation Award 2024The Digital Insurer
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAndrey Devyatkin
 
Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyKhushali Kathiriya
 
AXA XL - Insurer Innovation Award Americas 2024
AXA XL - Insurer Innovation Award Americas 2024AXA XL - Insurer Innovation Award Americas 2024
AXA XL - Insurer Innovation Award Americas 2024The Digital Insurer
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobeapidays
 
GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdflior mazor
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processorsdebabhi2
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businesspanagenda
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...apidays
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Drew Madelung
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherRemote DBA Services
 
FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024The Digital Insurer
 
A Beginners Guide to Building a RAG App Using Open Source Milvus
A Beginners Guide to Building a RAG App Using Open Source MilvusA Beginners Guide to Building a RAG App Using Open Source Milvus
A Beginners Guide to Building a RAG App Using Open Source MilvusZilliz
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...DianaGray10
 

Kürzlich hochgeladen (20)

Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ..."I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
 
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
 
Manulife - Insurer Transformation Award 2024
Manulife - Insurer Transformation Award 2024Manulife - Insurer Transformation Award 2024
Manulife - Insurer Transformation Award 2024
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of Terraform
 
Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : Uncertainty
 
AXA XL - Insurer Innovation Award Americas 2024
AXA XL - Insurer Innovation Award Americas 2024AXA XL - Insurer Innovation Award Americas 2024
AXA XL - Insurer Innovation Award Americas 2024
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
 
GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdf
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024
 
A Beginners Guide to Building a RAG App Using Open Source Milvus
A Beginners Guide to Building a RAG App Using Open Source MilvusA Beginners Guide to Building a RAG App Using Open Source Milvus
A Beginners Guide to Building a RAG App Using Open Source Milvus
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
 

Dm sei-tutorial-v7

  • 1. Data Science for Software Engineering (short version) Tim Menzies, West Virginia University Fayola Peters, West Virginia University SEI, August, 2013 SEI http://goo.gl/w4Acsi ICSE’13 http://goo.gl/29YTMu 0
  • 2. This talk: reflections on data science and software analytics 1 Two recent special issues of IEEE Software: July’13; Sept’13. Editors: Menzies & Zimmermann
  • 3. • Statistics • Operations research • Machine Learning • Data mining • Predictive Analytics • Business Intelligence • Data Science • Smart data • Big Data 2 Insert buzzword here
  • 4. Big data: not-so-successful stories • Community medicine – Additional manual collection required for their queries • Software engineering – Much product data • examples of source code – Little process data • costs, quality measures We go mining with the data we have, not the data we want. Get used to it. 3
  • 5. But what isn’t being said in the all the above about data mining + SE? 1. Its not just all about algorithms (people matter) 2. Data mining is a technical and a sociological problem – No point in talking about how to learn lessons from many organizations… – …. Unless those organizations let you access their data – The problem of privacy 3. When we learn from each other – There is more to sharing that just “you give me your model” • Local learning, ensembles, filtering.. 4
  • 6. OUTLINE • PART 0: Introduction • PART 1: Organization Issues – Know your domain – Data science is cyclic • PART 2: Data Issues – How to prune data, simpler & smarter – How to keep your data private • PART 3: Models – Envy-based learning – Ensembles 5
  • 7. OUTLINE • PART 0: Introduction • PART 1: Organization Issues – Know your domain – Data science is cyclic • PART 2: Data Issues – How to prune data, simpler & smarter – How to keep your data private • PART 3: Models – Envy-based learning – Ensembles 6
  • 8. What can we share? • Two software project managers meet – What can they learn from each other? • They can share 1. Data 2. Models 3. Methods • techniques for turning data into models 4. Insight into the domain • The standard mistake – Generally assumed that models can be shared, without modification. – Yeah, right… 7
  • 9. SE research = sparse sample of a very diverse set of activities 8 Microsoft research, Redmond, Building 99 Other studios, many other projects And they are all different.
  • 10. Models may not move (effort estimation) • 20 * 66% samples of data from NASA • Linear regression on each sample to learn effort = a*LOCb *Σiβixi • Back select to remove useless xi • Result? – Wide βivariance 9 * T. Menzies, A.Butcher, D.Cok, A.Marcus, L.Layman, F.Shull, B.Turhan, T.Zimmermann, "Local vs. Global Lessons for Defect Prediction and Effort Estimation," IEEE TSE pre-print 2012. http://menzies.us/pdf/12gense.pdf
  • 11. Models may not move (defect prediction) 10* T. Menzies, A.Butcher, D.Cok, A.Marcus, L.Layman, F.Shull, B.Turhan, T.Zimmermann, "Local vs. Global Lessons for Defect Prediction and Effort Estimation," IEEE TSE pre-print 2012. http://menzies.us/pdf/12gense.pdf
  • 12. Oh woe is me • No generality in SE? • Nothing we can learn from each other? • Forever doomed to never make a conclusion? – Always, laboriously, tediously, slowly, learning specific lessons that hold only for specific projects? • No: 3 things we might want to share – Models, methods, data • If no general models, then – Share methods • general methods for quickly turning local data into local models. – Share data • Find and transfer relevant data from other projects to us 11
  • 13. The rest of this tutorial • Data science – How to share data – How to share methods • Maybe one day, in the future, – after we’ve shared enough data and methods – We’ll be able to report general models • But first, – Some general notes on data mining 12
  • 14. OUTLINE • PART 0: Introduction • PART 1: Organization Issues – Know your domain – Data science is cyclic • PART 2: Data Issues – How to prune data, simpler & smarter – How to keep your data private • PART 3: Models – Envy-based learning – Ensembles 13
  • 15. OUTLINE • PART 0: Introduction • PART 1: Organization Issues –Know your domain – Data science is cyclic • PART 2: Data Issues – How to prune data, simpler & smarter – How to keep your data private • PART 3: Models – Envy-based learning – Ensembles 14
  • 16. Case Study : NASA • NASA’s Software Engineering Lab, 1990s – Gave free access to all comers to their data – But you had to come to get it (to Learn the domain) – Otherwise: mistakes • E.g. one class of software module with far more errors that anything else. – Dumb data mining algorithms: might learn that this kind of module in inherently more data prone • Smart data scientists might question “what kind of programmer work that module” – A: we always give that stuff to our beginners as a learning exercise 15* F. Shull, M. Mendonsa, V. Basili, J. Carver, J. Maldonado, S. Fabbri, G. Travassos, and M. Ferreira, "Knowledge- Sharing Issues in Experimental Software Engineering", EMSE 9(1): 111-137, March 2004.
  • 17. So algorithms are only part of the story 16 • Drew Conway, The Data Science Venn Diagram, 2009, • http://www.dataists.com/2010/09/the-data-science-venn-diagram/ • Dumb data miners miss important domains semantics • An ounce of domain knowledge is worth a ton to algorithms. • Math and statistics only gets you machine learning, • Science is about discovery and building knowledge, which requires some motivating questions about the world • The culture of academia, does not reward researchers for understanding domains.
  • 18. ` • ds 17 Source: Manuel Sevilla, http://goo.gl/cBKIh
  • 19. Management misconceptions of Big Data • All our data analysis problems will be solved – Once we boot a CPU farm – Once we bring up Hadoop and Map-reduce • If your first question is “what tools to buy?” – Then you are asking the wrong question 18
  • 20. • Deploy data scientists before deploying tools Tools can augment, but not replace, human insight 19Source: http://goo.gl/CCMZo
  • 21. The great myth • Wouldn’t it be wonderful if we did not have to listen to them – The dream of oldeworlde machine learning • Circa 1980s – Dispense with live experts and resurrect dead ones. • But any successful learner needs biases – Ways to know what’s important • What’s dull • What can be ignored – No bias? Can’t ignore anything • No summarization • No generalization • No way to predict the future 20 Lesson: TALK TO THE USERS!
  • 22. The Inductive Engineering Manifesto • Users before algorithms: – Mining algorithms are only useful in industry if users fund their use in real-world applications. • Data science – Understanding user goals to inductively generate the models that most matter to the user. 21 • T. Menzies, C. Bird, T. Zimmermann, W. Schulte, and E. Kocaganeli. The inductive software engineering manifesto. (MALETS '11).
  • 23. OUTLINE • PART 0: Introduction • PART 1: Organization Issues – Know your domain –Data science is cyclic • PART 2: Data Issues – How to prune data, simpler & smarter – How to keep your data private • PART 3: Models – Envy-based learning – Ensembles 22
  • 24. Do it again, and again, and again, and … 23 In any industrial application, data science is repeated multiples time to either answer an extra user question, make some enhancement and/or bug fix to the method, or to deploy it to a different set of users.
  • 25. Thou shall not click • For serious data science studies, – to ensure repeatability, – the entire analysis should be automated – using some high level scripting language; • e.g. R-script, Matlab, Bash, …. 24
  • 28. OUTLINE • PART 0: Introduction • PART 1: Organization Issues – Know your domain – Data science is cyclic • PART 2: Data Issues – How to prune data, simpler & smarter – How to keep your data private • PART 3: Models – Envy-based learning – Ensembles 27
  • 29. OUTLINE • PART 0: Introduction • PART 1: Organization Issues – Know your domain – Data science is cyclic • PART 2: Data Issues –How to prune data, simpler & smarter – How to keep your data private • PART 3: Models – Envy-based learning – Ensembles 28
  • 30. How to Prune Data, Simpler and Smarter 29 Data is the new oil
  • 31. And it has a cost too 30
  • 32. 31 Picking random training instance is not a good idea More popular instances in the active pool decrease error One of the stopping point conditions fires Data for Industry / Active Learning X-axis: Instances sorted in decreasing popularity numbers Y-axis:MedianMRE
  • 33. 32 Data for Industry / Active Learning At most 31% of all the cells On median 10% Intrinsic dimensionality: There is a consensus in the high-dimensional data analysis community that the only reason any methods work in very high dimensions is that, in fact, the data is not truly high-dimensional* * E. Levina and P.J. Bickel. Maximum likelihood estimation of intrinsic dimension. In Advances in Neural Information Processing Systems, volume 17, Cambridge, MA, USA, 2004. The MIT Press.
  • 34. OUTLINE • PART 0: Introduction • PART 1: Organization Issues – Know your domain – Data science is cyclic • PART 2: Data Issues – How to prune data, simpler & smarter –How to keep your data private • PART 3: Models – Envy-based learning – Ensembles 33
  • 35. Is Data Sharing Worth the Risk to Individual Privacy • Former Governor Massachusetts. • Victim of re-identification privacy breach. • Led to sensitive attribute disclosure of his medical records. What would William Weld say? 34
  • 36. Is Data Sharing Worth the Risk to Individual Privacy What about NASA contractors? Subject to competitive bidding every 2 years. Unwilling to share data that would lead to sensitive attribute disclosure. e.g. actual software development times 35
  • 37. When To Share – How To Share So far we cannot guarantee 100% privacy. What we have is a directive as to whether data is private and useful enough to share... We have a lot of privacy algorithms geared toward minimizing risk. Old School K-anonymity L-diversity T-closeness But What About Maximizing Benefits (Utility)? The degree of risk to the data sharing entity must not exceed the benefits of sharing. 36
  • 38. 37
  • 39. Balancing Privacy and Utility or... Minimize risk of privacy disclosure while maximizing utility. Instance Selection with CLIFF Small random moves with MORPH = CLIFF + MORPH F. Peters, T. Menzies, L. Gong, H. Zhang, "Balancing Privacy and Utility in Cross-Company Defect Prediction," IEEE Transactions on Software Engineering, 24 Jan. 2013. IEEE computer Society Digital Library. IEEE Computer Society 38
  • 40. CLIFF Don't share all the data. F. Peters, T. Menzies, L. Gong, H. Zhang, "Balancing Privacy and Utility in Cross-Company Defect Prediction," IEEE Transactions on Software Engineering, 24 Jan. 2013. IEEE computer Society Digital Library. IEEE Computer Society 39
  • 41. CLIFF Don't share all the data. "a=r1" powerful for selection for class=yes more common in "yes" than "no" CLIFF step1: for each class find ranks of all values F. Peters, T. Menzies, L. Gong, H. Zhang, "Balancing Privacy and Utility in Cross-Company Defect Prediction," IEEE Transactions on Software Engineering, 24 Jan. 2013. IEEE computer Society Digital Library. IEEE Computer Society 40
  • 42. CLIFF Don't share all the data. "a=r1" powerful for selection for class=yes more common in "yes" than "no" CLIFF step2: multiply ranks of each row F. Peters, T. Menzies, L. Gong, H. Zhang, "Balancing Privacy and Utility in Cross-Company Defect Prediction," IEEE Transactions on Software Engineering, 24 Jan. 2013. IEEE computer Society Digital Library. IEEE Computer Society 41
  • 43. CLIFF Don't share all the data. CLIFF step3: select the most powerful rows of each class Note linear time Can reduce N rows to 0.1N So an O(N2) NUN algorithm now takes time O(0.01) Scalability F. Peters, T. Menzies, L. Gong, H. Zhang, "Balancing Privacy and Utility in Cross-Company Defect Prediction," IEEE Transactions on Software Engineering, 24 Jan. 2013. IEEE computer Society Digital Library. IEEE Computer Society 42
  • 44. MORPH Push the CLIFF data from their original position. y = x ± (x − z) ∗ r x ∈ D, the original instance z ∈ D the NUN of x y the resulting MORPHed instance F. Peters and T. Menzies, “Privacy and utility for defect prediction: Experiments with morph,” in Software Engineering (ICSE), 2012 34th International Conference on, june 2012, pp. 189 –199. F. Peters, T. Menzies, L. Gong, H. Zhang, "Balancing Privacy and Utility in Cross-Company Defect Prediction," IEEE Transactions on Software Engineering, 24 Jan. 2013. IEEE computer Society Digital Library. IEEE Computer Society 43
  • 45. Case Study: Cross-Company Defect Prediction (CCDP) Sharing Required. Zimmermann et al. Local data not always available • companies too small • product in first release, so no past data. Kitchenham et al. • no time for collection • new technology can make all data irrelevant T. Zimmermann, N. Nagappan, H. Gall, E. Giger, and B. Murphy, “Cross-project defect prediction: a large scale experiment on data vs. domain vs. process.” in ESEC/SIGSOFT FSE’09,2009 B. A. Kitchenham, E. Mendes, and G. H. Travassos, “Cross versus within-company cost estimation studies: A systematic review,” IEEE Transactions on Software Engineering, vol. 33, pp. 316–329, 2007 - Company B has little or no data to build a defect model; - Company B uses data from Company A to build defect models; 44
  • 46. Measuring the Risk IPR = Increased Privacy Ratio Queries Original Privatized Privacy Breach Q1 0 0 yes Q2 0 1 no Q3 1 1 yes yes = 2/3 IPR = 1- 2/3 = 0.33 F. Peters, T. Menzies, L. Gong, H. Zhang, "Balancing Privacy and Utility in Cross-Company Defect Prediction," IEEE Transactions on Software Engineering, 24 Jan. 2013. IEEE computer Society Digital Library. IEEE Computer Society 45
  • 47. Measuring the Utility The g-measure Probability of detection (pd) Probability of False alarm (pf) Actual yes no Predicted yes TP FP no FN TN pd TP/(TP+FN) pf FP/(FP+TN) g-measure 2*pd*(1-pf)/(pd+(1-pf)) F. Peters, T. Menzies, L. Gong, H. Zhang, "Balancing Privacy and Utility in Cross-Company Defect Prediction," IEEE Transactions on Software Engineering, 24 Jan. 2013. IEEE computer Society Digital Library. IEEE Computer Society 46
  • 48. Making Data Private for CCDP Comparing CLIFF+MORPH to Data Swapping and K-anonymity Data Swapping (s10, s20, s40) A standard perturbation technique used for privacy To implement... • For each NSA a certainpercent of the values areswapped with anyothervalue in that NSA. • For our experiments,these percentages are 10, 20 and 40. k-anonymity (k2, k4) The Datafly Algorithm. To implement... • Make a generalizationhierarchy. • Replace values in the NSAaccording to thehierarchy. • Continue until there are k or fewer distinct instancesand suppress them. K. Taneja, M. Grechanik, R. Ghani, and T. Xie, “Testing software in age of data privacy: a balancing act,” in Proceedings of the 19th ACM SIGSOFT symposium and the 13th European conference on Foundations of software engineering, ser. ESEC/FSE ’11. New York, NY, USA: ACM, 2011, pp. 201–211. L. Sweeney, “Achieving k-anonymity privacy protection using generalization and suppression,” Int. J. Uncertain. Fuzziness Knowl.-Based Syst., vol. 10, no. 5, pp. 571–588, Oct. 2002. 47
  • 49. Making Data Private for CCDP Comparing CLIFF+MORPH to Data Swapping and K-anonymity F. Peters, T. Menzies, L. Gong, H. Zhang, "Balancing Privacy and Utility in Cross-Company Defect Prediction," IEEE Transactions on Software Engineering, 24 Jan. 2013. IEEE computer Society Digital Library. IEEE Computer Society 48
  • 50. Making Data Private for CCDP Comparing CLIFF+MORPH to Data Swapping and K-anonymity F. Peters, T. Menzies, L. Gong, H. Zhang, "Balancing Privacy and Utility in Cross-Company Defect Prediction," IEEE Transactions on Software Engineering, 24 Jan. 2013. IEEE computer Society Digital Library. IEEE Computer Society 49
  • 51. Making Data Private for CCDP F. Peters, T. Menzies, L. Gong, H. Zhang, "Balancing Privacy and Utility in Cross-Company Defect Prediction," IEEE Transactions on Software Engineering, 24 Jan. 2013. IEEE computer Society Digital Library. IEEE Computer Society
  • 52. OUTLINE • PART 0: Introduction • PART 1: Organization Issues – Know your domain – Data science is cyclic • PART 2: Data Issues – How to prune data, simpler & smarter – How to keep your data private • PART 3: Models – Envy-based learning – Ensembles 51
  • 53. OUTLINE • PART 0: Introduction • PART 1: Organization Issues – Know your domain – Data science is cyclic • PART 2: Data Issues – How to prune data, simpler & smarter – How to keep your data private • PART 3: Models –Envy-based learning – Ensembles 52
  • 54. • Seek the fence where the grass is greener on the other side. • Learn from there • Test on here • Cluster to find “here” and “there” 53 Envy = The WisDOM Of the COWs
  • 55. 54 @attribute recordnumber real @attribute projectname {de,erb,gal,X,hst,slp,spl,Y} @attribute cat2 {Avionics, application_ground, avionicsmonitoring, … } @attribute center {1,2,3,4,5,6} @attribute year real @attribute mode {embedded,organic,semidetached} @attribute rely {vl,l,n,h,vh,xh} @attribute data {vl,l,n,h,vh,xh} … @attribute equivphyskloc real @attribute act_effort real @data 1,de,avionicsmonitoring,g,2,1979,semidetached,h,l,h,n,n,l,l,n,n,n,n,h,h,n,l,25.9,117.6 2,de,avionicsmonitoring,g,2,1979,semidetached,h,l,h,n,n,l,l,n,n,n,n,h,h,n,l,24.6,117.6 3,de,avionicsmonitoring,g,2,1979,semidetached,h,l,h,n,n,l,l,n,n,n,n,h,h,n,l,7.7,31.2 4,de,avionicsmonitoring,g,2,1979,semidetached,h,l,h,n,n,l,l,n,n,n,n,h,h,n,l,8.2,36 5,de,avionicsmonitoring,g,2,1979,semidetached,h,l,h,n,n,l,l,n,n,n,n,h,h,n,l,9.7,25.2 6,de,avionicsmonitoring,g,2,1979,semidetached,h,l,h,n,n,l,l,n,n,n,n,h,h,n,l,2.2,8.4 …. DATA = MULTI-DIMENSIONAL VECTORS
  • 56. CAUTION: data may not divide neatly on raw dimensions • The best description for SE projects may be synthesize dimensions extracted from the raw dimensions 55
  • 57. Fastmap 56 Fastmap: Faloutsos [1995] O(2N) generation of axis of large variability • Pick any point W; • Find X furthest from W, • Find Y furthest from Y. c = dist(X,Y) All points have distance a,b to (X,Y) • x = (a2 + c2 − b2)/2c • y= sqrt(a2 – x2) Find median(x), median(y) Recurse on four quadrants
  • 58. Hierarchical partitioning Prune • Find two orthogonal dimensions • Find median(x), median(y) • Recurse on four quadrants • Combine quadtree leaves with similar densities • Score each cluster by median score of class variable 57 Grow
  • 59. Q: why cluster Via FASTMAP? • A1: Circular methods (e.g. k-means) assume round clusters. • But density-based clustering allows clusters to be any shape • A2: No need to pre-set the number of clusters • A3: cause other methods (e.g. PCA) are much slower • Fastmap is the O(2N) • Unoptimized Python: 58
  • 61. • Seek the fence where the grass is greener on the other side. • Learn from there • Test on here • Cluster to find “here” and “there” 60 Envy = The WisDOM Of the COWs
  • 62. Hierarchical partitioning Prune • Find two orthogonal dimensions • Find median(x), median(y) • Recurse on four quadrants • Combine quadtree leaves with similar densities • Score each cluster by median score of class variable • This cluster envies its neighbor with better score and max abs(score(this) - score(neighbor)) 61 Grow Where is grass greenest?
  • 63. Q: How to learn rules from neighboring clusters • A: it doesn’t really matter – Many competent rule learners • But to evaluate global vs local rules: – Use the same rule learner for local vs global rule learning • This study uses WHICH (Menzies [2010]) – Customizable scoring operator – Faster termination – Generates very small rules (good for explanation) 62
  • 64. Data from http://promisedata.org/data • Effort reduction = { NasaCoc, China } : COCOMO or function points • Defect reduction = {lucene,xalanjedit,synapse,etc } : CK metrics(OO) • Clusters have untreated class distribution. • Rules select a subset of the examples: – generate a treated class distribution • 63 0 20 40 60 80 100 25th 50th 75th 100th untreated global local Distributions have percentiles: Treated with rules learned from all data Treated with rules learned from neighboring cluster
  • 65. • Lower median efforts/defects (50th percentile) • Greater stability (75th – 25th percentile) • Decreased worst case (100th percentile) By any measure, Local BETTER THAN GLOBAL 64
  • 66. Rules learned in each cluster • What works best “here” does not work “there” – Misguided to try and tame conclusion instability – Inherent in the data • Can’t tame conclusion instability. • Instead, you can exploit it • Learn local lessons that do better than overly generalized global theories 65
  • 67. OUTLINE • PART 0: Introduction • PART 1: Organization Issues – Know your domain – Data science is cyclic • PART 2: Data Issues – How to prune data, simpler & smarter – How to keep your data private • PART 3: Models – Envy-based learning –Ensembles 66
  • 68. 67B. Turhan, “On the Dataset Shift Problem in Software Engineering Prediction Models”, Empirical Software Engineering Journal, Vol.17/1- 2, pp.62-74, 2012. Outlier ‘Detection ’ Relevancy Filtering Instance Weighting Stratification Cost Curves Mixture Models Managing Dataset Shift Covariate Shift Prior Probability Shift Sampling Imbalanced Data Domain Shift Source Component Shift
  • 69. Solutions to SE Model Problems/ Ensembles of Learning Machines*  Sets of learning machines grouped together.  Aim: to improve predictive performance. ... estimation1 estimation2 estimationN Base learners E.g.: ensemble estimation = Σ wi estimationi B1 B2 BN * T. Dietterich. Ensemble Methods in Machine Learning. Proceedings of the First International Workshop in Multiple Classifier Systems. 2000. 68
  • 70. Solutions to SE Model Problems/ Ensembles of Learning Machines  One of the keys: Diverse* ensemble: “base learners” make different errors on the same instances. * G. Brown, J. Wyatt, R. Harris, X. Yao. Diversity Creation Methods: A Survey and Categorisation. Journal of Information Fusion 6(1): 5-20, 2005. 69
  • 71. Solutions to SE Model Problems/ Dynamic Adaptive Ensembles  Dynamic Cross-company Learning (DCL) DCL uses new completed projects that arrive with time. DCL determines when CC data is useful. DCL adapts to changes by using CC data. Predicting effort for a single company from ISBSG based on its projects and other companies' projects. * L. Minku, X. Yao. Can Cross-company Data Improve Performance in Software Effort Estimation? Proceedings of the 8th International Conference on Predictive Models in Software Engineering, p. 69-78, 2012. http://dx.doi.org/10.1145/2365324.2365334. 70

Hinweis der Redaktion

  1. Tim, Ekrem