SlideShare ist ein Scribd-Unternehmen logo
1 von 96
Downloaden Sie, um offline zu lesen
Most slides prepared in collaboration with Ahmed Hassan and Dongmei Zhang
More information available at
https://sites.google.com/site/asergrp/dmse/
Software Mining and Software Datasets
Tao Xie
University of Illinois at Urbana-Champaign
http://taoxie.cs.illinois.edu/
taoxie@illinois.edu
• Associate Professor at University of Illinois at Urbana-
Champaign, USA
• Leads the ASE research group at Illinois
• PC Chair of ISSTA 2015, PC Co-Chair of MSR 2011/2012,
ICSM 2009
• Co-organizer of 2007 Dagstuhl Seminar on Mining
Programs and Processes, 2013 NII Shonan Meeting on
Software Analytics: Principles and Practice
2
Software Services
3
Individual
Social
Isolated
Not much content
generation
Collaborative
Huge amount of artifacts
generated anywhere anytime
4
5
Data pervasive
Long product cycle
Experience & gut-feeling
In-lab testing
Informed decision making
Centralized development
Code centric
Debugging in the large
Distributed development
Continuous release
… …
http://msrconf.org
An international effort to
make software repositories actionable
http://openscience.us/repo/
• Transforms static record-
keeping repositories to active
repositories
• Makes repository data
actionable by uncovering
hidden patterns and trends
12
MailinglistBugzilla Crashes
Field logs CVS/SVN
1313
Field
Logs
Source Control
CVS/SVN
Bugzilla Mailing
lists
Crash
Repos
Historical Repositories Runtime Repos
Code Repos
Sourceforge
GoogleCode
A. Hassan and T Xie. Software Intelligence: Future of
Mining Software Engineering Data. In FoSER 2010.
Software analytics is to enable software practitioners to
perform data exploration and analysis in order to obtain
insightful and actionable information for data-driven
tasks around software and services.
16
Dongmei Zhang, Yingnong Dang, Jian-Guang Lou, Shi Han, Haidong Zhang, and Tao Xie. Software Analytics as
a Learning Case in Practice: Approaches and Experiences. In MALETS 2011
http://research.microsoft.com/en-us/groups/sa/malets11-analytics.pdf
Software analytics is to enable software practitioners to
perform data exploration and analysis in order to obtain
insightful and actionable information for data-driven
tasks around software and services.
17
Dongmei Zhang, Yingnong Dang, Jian-Guang Lou, Shi Han, Haidong Zhang, and Tao Xie. Software Analytics as
a Learning Case in Practice: Approaches and Experiences. In MALETS 2011
http://research.microsoft.com/en-us/groups/sa/malets11-analytics.pdf
18
Research
Topics
Technology Pillars
Target
Audience
Connection to
Practice
Output
19
Software
Users
Software
Development
Process
Software
System
• Covering different areas
of software domain
• Throughout entire
development cycle
• Enabling practitioners to
obtain insights
20
Runtime traces
Program logs
System events
Perf counters
…
Usage log
User surveys
Online forum posts
Blog & Twitter
…
Source code
Bug history
Check-in history
Test cases
…
21
Developer
Tester
Program Manager
Usability engineer
Designer
Support engineer
Management personnel
Operation engineer
• Conveys meaningful and useful understanding
or knowledge towards completing the target
task
• Not easily attainable via directly investigating
raw data without aid of analytics technologies
• Example
– It is easy to count the number of re-opened bugs,
but how to find out the primary reasons for these
re-opened bugs?
22
• Enables software practitioners to come up
with concrete solutions towards completing
the target task
• Examples
– Why bugs were re-opened?
• A list of bug groups each with the same reason of re-
opening
– Which part of my code should be refactored?
• A list of cloned code snippets easily explored from
different perspectives
23
24
Software
Users
Software
Development
Process
Software
System
Information Visualization
Data Analysis Algorithms
Large-scale Computing
Vertical
Horizontal
Bugzilla CVS/SVNMailinglist Crashes
fixed
bug
discussions
Buggy change &
Fixing change
Field
crashes
Estimate fix effort
Mark duplicates
Suggest experts and fix!
New Bug Report
Bugzilla CVS/SVNMailinglist Crashes
fixed
bug
Field
crashes
Suggest APIs
Warn about risky code or bugs
Suggest locations to co-change
New Change
discussions
Buggy change &
Fixing change
Textual Software Artifacts
NL Software Artifacts are of Many Types
• requirements
documents
• code comments
• identifier names
• commit logs
• release notes
• bug reports
• …
• emails discussing bugs,
designs, etc.
• mailing list discussions
• test plans
• project websites & wikis
• …
29
NL Software Artifacts are of Large Quantity
• code comments:
– 2M in Eclipse, 1M in Mozilla, 1M in Linux
• identifier names:
– 1M in Chrome
• commit logs:
– 222K for Linux (05-10), 31K for PostgreSQL
• bug reports:
– 641K in Mozilla, 18K in Linux, 7K in Apache
• …
NL data contains useful information, much of which is not in structured data.
linux/drivers/scsi/in2000.c:
static int in2000_bus_reset(…){ …
reset_hardware(…);
…
}
Code comments contain Specifications
No lock acquisition ⇒ A bug!
linux/drivers/scsi/in2000.c:
/* Caller must hold instance
lock! */
static int reset_hardware(…)
{…}
Tan et al. “/*iComment: Bugs or Bad Comments?*/”, SOSP’07
API documentation contains resource usages
• java.sql.ResultSet.deleteRow() : “Deletes
the current row from this ResultSet object
and from the underlying database”
• java.sql.ResultSet.close() : “Releases this
ResultSet object’s database and JDBC
resources immediately instead of waiting for
this to happen when it is automatically closed”.
java.sql.ResultSet.deleteRow() 
java.sql.ResultSet.close()
Zhong, Zhang, Xie, Mei. Inferring Resource Specifications from Natural Language API Documentation. ASE 2009.
NL Data Contains Useful Info – Example 3
Don’t ignore the semantics of identifiers
Sridhara, Pollock, Vijay-Shanker. Automatically Detecting and Describing High Level Actions within Methods. ICSE 2011
• Unstructured
– Hard to parse, sometimes wrong grammar
• Ambiguous: often has no defined or precise
semantics (as opposed to source code)
– Hard to understand
• Many ways to represent similar concepts
– Hard to extract information from
/* We need to acquire the write IRQ lock before calling ep_unlink(). */
/* Lock must be acquired on entry to this function. */
/* Caller must hold instance lock! */
• Redundant data
• Easy to get “good” results for simple tasks
– Simple algorithms without much tuning effort
• Evolution/version history readily available
• Many techniques to borrow from text
analytics: NLP, Machine Learning (ML),
Information Retrieval (IR), etc.
Stepping Back ….
Image from https://www.snowfactor.com/events/back-to-the-future-quiz/
https://www.kaggle.com/c/the-allen-ai-science-challenge
“consistently understand and correctly
answer general questions about the world.”
“Using a dataset of multiple choice question and answers from a
standardized 8th grade science exam, AI2 is challenging you
to create a model that gets to the head of the class.”
A. E. Hassan and T. Xie:
Mining Software Engineering
Data
• Bug report image
• Overlay the triage questions
Duplicate?
Bugzilla: open source bug tracking tool
http://www.bugzilla.org/
http://www.cs.ubc.ca/labs/spl/projects/bugTriage.html
Assigned To: ?
Anvik, Hiew, Murphy. Who should fix this bug? ICSE 2006.
Wang, Zhang, Xie, Anvik, Sun. An Approach to Detecting Duplicate Bug Reports using Natural Language and Execution Information. ICSE 2008.
Thummalapenta, Sinha, Singhania, Chandra. Automating Test Automation Suresh. ICSE 2012.
Thummalapenta, Sinha, Singhania, Chandra. Automating Test Automation Suresh. ICSE 2012.
Selected units
Selection
threshold
Example of selection based approach from MS Word
conversational structure
…
…Rastkar, Murphy, Murray. Summarizing software artifacts: A case study of bug reports. ICSE’ 10.
Security bug report
“An attacker can exploit a buffer overflow by
sending excessive data into an input field.”
Mislabeled security bug report
“The system crashes when receiving excessive
text in the input field”
Two bug reports describing a buffer overflow
M. Gegick, P. Rotella, T. Xie. Identifying Security Bug Reports via Text Mining: An Industrial Case Study. MSR’10
Term Bug
Report 1
Bug
Report 2
Bug
Report 3
Attack 1 0 1
Buffer
Overflow
1 0 0
Vulnerability 3 0 0
…
Term-by-document frequency matrix quantifies a document
M. Gegick, P. Rotella, T. Xie. Identifying Security Bug Reports via Text Mining: An Industrial Case Study. MSR’10
Start List
Label: Security Label: Non-Security Label:?
A HCP should not change patient’s
account.
An [subject: HCP] should not [action:
change] [resource: patient’s account].
ACP Rule
EffectSubject Action Resource
HCP UPDATE -
change
patient’s
account
deny
Linguistic Analysis
Model-Instance Construction
Transformation
Xiao, Paradkar, Thummalapenta, Xie. Automated Extraction of Security Policies from Natural-Language Software Documents. FSE 2012.
Example Repository:
Project Communication – Mailing lists
A. E. Hassan and T. Xie:
Mining Software Engineering
Data
54
• Most open source projects communicate
through mailing lists or IRC/IM channels
• Rich source of information about the inner
workings of large projects
• Discussions cover topics such as future plans,
design decisions, project policies, code or
patch reviews
• Social network analysis could be performed on
discussion threads
A. E. Hassan and T. Xie:
Mining Software Engineering
Data
55
• Study the content of messages before and after a release
• Use dimensions from a psychometric text analysis tool:
– After Apache 1.3 release there was a drop in optimism
– After Apache 2.0 release there was an increase in sociability
Rigby, Hassan. What can OSS mailing lists tell us? a preliminary psychometric text
analysis of the apache developer mailing list. MSR 2007.
A. E. Hassan and T. Xie:
Mining Software Engineering
Data
56
• When will a developer be invited to join a
project?
– Expertise vs. interest
Bird, Gourley, Devanbu, Swaminathan, Hsu. Open borders? immigration in open source projects. MSR 2007.
Example Repositories:
Source Control and Bug Repositories
A. E. Hassan and T. Xie: Mining
Software Engineering Data
58
Source Control Repositories
• A source control system
tracks changes to
ChangeUnits
• Example of ChangeUnits:
– File (most common)
– Function
– Dependency (e.g., Call)
• Each ChangeUnit:
– Records the developer, change
time, change message, co-
changing Units
ChangeListDeveloper
Time
ChangeChangeUnit
Modify
Add
Remove
Change
Type
* .. *
ChangeList
Message
FI
FR
GM
ChangeList
Type
FI: Feature Introduction
FR: Fault Repairing
GM: General Maint
A. E. Hassan and T. Xie:
Mining Software Engineering
Data
59
Determine
Initial Entity
To Change
Change
Entity
Determine
Other Entities
To Change
Consult
Guru for
Advice
New Req., Bug Fix
“How does a change in one source code
entity propagate to other entities?”
No More
Changes
For Each Entity
Suggested Entity
A. E. Hassan and T. Xie:
Mining Software Engineering
Data
60
• Mine association rules from change history
• Use rules to help propagate changes:
– Recall as high as 44%
– Precision around 30%
• High precision and recall reached in < 1mth
• Prediction accuracy improves prior to a
release (i.e., during maintenance phase)
• Better predictor than static dependencies
alone
Zimmermann, Zeller, Weissgerber, Diehl. Mining Version Histories to Guide Software Changes. TSE 2005.
Hassan, Holt. Predicting change propagation in software systems. ICSM2004.
A. E. Hassan and T. Xie:
Mining Software Engineering
Data
61
import org.eclipse.jdt.internal.compiler.lookup.*;
import org.eclipse.jdt.internal.compiler.*;
import org.eclipse.jdt.internal.compiler.ast.*;
import org.eclipse.jdt.internal.compiler.util.*;
...
import org.eclipse.pde.core.*;
import org.eclipse.jface.wizard.*;
import org.eclipse.ui.*;
14% of all files that import ui packages,
had to be fixed later on.
71% of files that import compiler packages,
had to be fixed later on.
Schröter, Zimmermann, Zeller. Predicting Component Failures at Design Time. ISESE 2006.
A. E. Hassan and T. Xie:
Mining Software Engineering
Data
62
• Given a change can we warn a developer that
there is a bug in it?
– Recall/Precision in 50-60% range
Kim, Zimmermann, Pan, Whitehead. Automatic identification of bug-introducing changes. ASE 2006.
A. E. Hassan and T. Xie:
Mining Software Engineering
Data
63
Percentage of bug-introducing changes for Eclipse
Don’t program on Fridays ;-)
Sliwerski, Zimmermann, Zeller. Don’t Program on Fridays! How to Locate Fix-Inducing Changes. WSR 2005.
A. E. Hassan and T. Xie:
Mining Software Engineering
Data
64
Failure is a 4-letter Word
Zeller, Zimmermann, Bird. Failure is a four-letter word: a parody in empirical research. PROMISE 2011.
A. E. Hassan and T. Xie:
Mining Software Engineering
Data
65
Failure is a 4-letter Word
Zeller, Zimmermann, Bird. Failure is a four-letter word: a parody in empirical research. PROMISE 2011.
• “Cross-validation is inappropriate for estimating
the performance of change classification because
the data points, i.e., changes, follow a certain
order in time.”
• “Randomly partitioning the data set may cause a
model to use future knowledge which should not
be known at the time of prediction to predict
changes in the past.”
– E.g., use information on a change committed in 2014
to predict whether a change committed in 2012 is
buggy or clean.
Tan, Tan, Dara, Mayuex. Online defect prediction for imbalanced data. ICSE 2015 SEIP.
Ye, Bunescu, Liu. Learning to rank relevant files for bug reports using domain knowledge. FSE 2014.
J. Czerwonka, R. Das, N. Nagappan, A. Tarvo, A. Teterev. CRANE: Failure Prediction, Change
Analysis and Test Prioritization in Practice - Experiences from Windows. ICST 2011.
Example Repositories:
Program Source Code
A. E. Hassan and T. Xie:
Mining Software Engineering
Data
69
Source data Mined info
Variable names and function names Software categories
[Kawaguchi et al. 04]
Statement seq in a basic block Copy-paste code
[Li et al. 04]
Set of functions, variables, and data
types within a C function
Programming rules
[Li&Zhou 05]
Sequence of methods within a Java
method
API usages
[Xie&Pei 06]
API method signatures API Jungloids
[Mandelin et al. 05]
• Tons of papers published in the past decade
• Many years of International Workshop on
Software Clones (IWSC) since 2006
• Dagstuhl Seminars
– Software Clone Management towards Industrial
Application (2012)
– Duplication, Redundancy, and Similarity in Software
(2006)
70
Source: http://www.dagstuhl.de/12071
• Motivation
– Copy-and-paste is a common developer behavior
– A real tool widely adopted internally and externally
• XIAO enables code clone analysis in the following
way
– High tunability
– High scalability
– High compatibility
– High explorability
71Dang, Zhang, Ge, Chu, Qiu, Xie. XIAO: Tuning Code Clones at Hands of Engineers in Practice. ACSAC 2012.
72
• Intuitive similarity metric
• Effective control of the degree of syntactical differences between two
code snippets
• Tunable at fine granularity
• Statement similarity
• % of inserted/deleted/modified statements
• Balance between code structure and disordered statements
for (i = 0; i < n; i ++) {
a ++;
b ++;
c = foo(a, b);
d = bar(a, b, c);
e = a + c; }
for (i = 0; i < n; i ++) {
c = foo(a, b);
a ++;
b ++;
d = bar(a, b, c);
e = a + d;
e ++; }
Dang, Zhang, Ge, Chu, Qiu, Xie. XIAO: Tuning Code Clones at Hands of Engineers in Practice. ACSAC 2012.
73
1. Clone navigation based on source tree hierarchy
2. Pivoting of folder level statistics
3. Folder level statistics
4. Clone function list in selected folder
5. Clone function filters
6. Sorting by bug or refactoring potential
7. Tagging
1 2 3 4 5 6
7
1. Block correspondence
2. Block types
3. Block navigation
4. Copying
5. Bug filing
6. Tagging
1
2
3
4
1
6
5
Dang, Zhang, Ge, Chu, Qiu, Xie. XIAO: Tuning Code Clones at Hands of Engineers in Practice. ACSAC 2012.
74
Quality gates at milestones
• Architecture refactoring
• Code clone clean up
• Bug fixing
Post-release maintenance
• Security bug investigation
• Bug investigation for sustained
engineering
Development and testing
• Checking for similar issues before
check-in
• Reference info for code review
• Supporting tool for bug triage
Online code
clone search
Offline code
clone analysis
Dang, Zhang, Ge, Chu, Qiu, Xie. XIAO: Tuning Code Clones at Hands of Engineers in Practice. ACSAC 2012.
75
Available in Visual Studio 2012
Searching similar snippets
for fixing bug once
Finding refactoring
opportunity
Dang, Zhang, Ge, Chu, Qiu, Xie. XIAO: Tuning Code Clones at Hands of Engineers in Practice. ACSAC 2012.
76
Code Clone Search service integrated into
workflow of Microsoft Security Response Center
Over hundreds of million lines of code indexed
across multiple products
Real security issues proactively identified and
addressed
Dang, Zhang, Ge, Chu, Qiu, Xie. XIAO: Tuning Code Clones at Hands of Engineers in Practice. ACSAC 2012.
Combined Security Update for Microsoft Office, Windows, .NET Framework,
and Silverlight, published: Tuesday, May 08, 2012
3 publicly disclosed vulnerabilities and seven privately reported involved.
Specifically, one is exploited by the Duqu malware to execute arbitrary code
when a user opened a malicious Office document
Insufficient bounds check within the font parsing subsystem of win32k.sys
Cloned copy in gdiplus.dll, ogl.dll (office), Silver Light, Windows Journal viewer
Microsoft Technet Blog about this bulletin
“However, we wanted to be sure to address the vulnerable code wherever it
appeared across the Microsoft code base. To that end, we have been working
with Microsoft Research to develop a “Cloned Code Detection” system that we
can run for every MSRC case to find any instance of the vulnerable code in any
shipping product. This system is the one that found several of the copies of
CVE-2011-3402 that we are now addressing with MS12-034.”
77Dang, Zhang, Ge, Chu, Qiu, Xie. XIAO: Tuning Code Clones at Hands of Engineers in Practice. ACSAC 2012.
Software Data Sets
A. E. Hassan and T. Xie:
Mining Software Engineering
Data
79
• PROMISE repository
– http://openscience.us/repo/
• Boa
– http://boa.cs.iastate.edu/
• FLOSSmole:
– http://flossmole.org/
• Software-artifact infrastructure repository:
– http://sir.unl.edu/portal/index.html
• Socorro: Mozilla Crash Stats project
– https://wiki.mozilla.org/Socorro
A. E. Hassan and T. Xie:
Mining Software Engineering
Data
80
• FLOSSmole
– provides raw data about open source projects
– provides summary reports about open source projects
– integrates donated data from other research teams
– provides tools so you can gather your own data
• Data sources
– Sourceforge
– Freshmeat
– Rubyforge
– ObjectWeb
– Free Software Foundation (FSF)
– SourceKibitzer
http://flossmole.org/
A. E. Hassan and T. Xie:
Mining Software Engineering
Data
81
A. E. Hassan and T. Xie:
Mining Software Engineering
Data
82
Yearly MSR Challenge Since 2006
• http://2016.msrconf.org/#/challenge (Boa data for SourceForge, GitHub)
• http://2015.msrconf.org/challenge.php (Stackoverflow data)
• http://2014.msrconf.org/challenge.php (GitHub data)
• http://2013.msrconf.org/challenge.php (Stackoverflow data)
• http://2012.msrconf.org/challenge.php (Change/bug report data for Android)
• http://2011.msrconf.org/msr-challenge.html (Eclipse, Netbeans, Firefox,
Chrome data)
• http://msr.uwaterloo.ca/msr2010/challenge/ (FreeBSD, GNOME
Debian/Ubuntu)
• http://msr.uwaterloo.ca/msr2009/challenge/ (GNOME)
• http://msr.uwaterloo.ca/msr2008/ (Eclipse)
• http://msr.uwaterloo.ca/msr2007/challenge/ (Eclipse, Firefox)
• http://msr.uwaterloo.ca/challenge/ (PostgreSQL, ArgoUML)
• http://msr.uwaterloo.ca/msr2005/
• http://msr.uwaterloo.ca/msr2004/
A. E. Hassan and T. Xie:
Mining Software Engineering
Data
83
• PROMISE repository
– http://openscience.us/repo/
• TraceLab
– http://www.coest.org/index.php/tracelab/about-tracelab
• Apache SVN commits on Github
– https://github.com/monperrus/apache-commits/
• Benchmarks for software maintenance tasks
– http://www.cs.wm.edu/semeru/data/benchmarks/
• Non-functional requirements wordlists
– http://softwareprocess.es/static/What%27s_in_a_Name.html
• Source code ECOsystem Linked Data (SeCold)
– http://www.secold.org/
• Text Analysis for Software Engineering Wiki
– http://textse.wikispaces.com/Home
Must show value before
data quality improves
Correlation vs. Causation
A. E. Hassan and T. Xie:
Mining Software Engineering
Data
91
• Make sure you manually examine the
repositories. Do not fully automate the process!
Image from http://www.quincyma.gov/Government/OCS/neighborhoodtips.cfm
A. E. Hassan and T. Xie:
Mining Software Engineering
Data
92
• Drop all transactions above a large threshold
A. E. Hassan and T. Xie:
Mining Software Engineering
Data
93
• Few developers are given commit privileges
• Actual developer is usually mentioned in the change
message
• One must study project commit policies before
reaching any conclusions
[German 2006]
94
APP DEVELOPERS
APP USERS
App Functional
Requirements
App Security
Requirements
User
Functional
Requirements
User Security
Requirements
informal: app description, etc. permission list, etc.
App Code
Pandita, Xiao, Yang, Enck, Xie. WHYPER: Towards Automating Risk Assessment of Mobile Applications.
USENIX SEC 2013.
Tutorial Slides: http://www.slideshare.net/taoxiease/text-analytics-for-security
• External collaborators: Ahmed Hassan,
Dongmei Zhang, …
• Students…
• Broad colleagues in this area, …
• Funding supports:

Weitere ähnliche Inhalte

Was ist angesagt?

GRASP – паттерны Объектно-Ориентированного Проектирования
GRASP – паттерны Объектно-Ориентированного ПроектированияGRASP – паттерны Объектно-Ориентированного Проектирования
GRASP – паттерны Объектно-Ориентированного Проектирования
Alexander Nemanov
 

Was ist angesagt? (15)

Mypy pycon-fi-2012
Mypy pycon-fi-2012Mypy pycon-fi-2012
Mypy pycon-fi-2012
 
Whitepaper tableau for-the-enterprise-0
Whitepaper tableau for-the-enterprise-0Whitepaper tableau for-the-enterprise-0
Whitepaper tableau for-the-enterprise-0
 
Source Code Analysis with SAST
Source Code Analysis with SASTSource Code Analysis with SAST
Source Code Analysis with SAST
 
Solid Principles
Solid PrinciplesSolid Principles
Solid Principles
 
GRASP – паттерны Объектно-Ориентированного Проектирования
GRASP – паттерны Объектно-Ориентированного ПроектированияGRASP – паттерны Объектно-Ориентированного Проектирования
GRASP – паттерны Объектно-Ориентированного Проектирования
 
Oracle Apex Overview
Oracle Apex OverviewOracle Apex Overview
Oracle Apex Overview
 
Decorator Design Pattern
Decorator Design PatternDecorator Design Pattern
Decorator Design Pattern
 
Data Streaming with Apache Kafka in the Defence and Cybersecurity Industry
Data Streaming with Apache Kafka in the Defence and Cybersecurity IndustryData Streaming with Apache Kafka in the Defence and Cybersecurity Industry
Data Streaming with Apache Kafka in the Defence and Cybersecurity Industry
 
Security misconfiguration
Security misconfigurationSecurity misconfiguration
Security misconfiguration
 
Recon and Bug Bounties - What a great love story!
Recon and Bug Bounties - What a great love story!Recon and Bug Bounties - What a great love story!
Recon and Bug Bounties - What a great love story!
 
The Log4Shell Vulnerability – explained: how to stay secure
The Log4Shell Vulnerability – explained: how to stay secureThe Log4Shell Vulnerability – explained: how to stay secure
The Log4Shell Vulnerability – explained: how to stay secure
 
WAF 101
WAF 101WAF 101
WAF 101
 
Design principles - SOLID
Design principles - SOLIDDesign principles - SOLID
Design principles - SOLID
 
Secure Coding and Threat Modeling
Secure Coding and Threat ModelingSecure Coding and Threat Modeling
Secure Coding and Threat Modeling
 
Web Application Security 101
Web Application Security 101Web Application Security 101
Web Application Security 101
 

Andere mochten auch

Awareness Support in Global Software Development: A Systematic Review Based o...
Awareness Support in Global Software Development: A Systematic Review Based o...Awareness Support in Global Software Development: A Systematic Review Based o...
Awareness Support in Global Software Development: A Systematic Review Based o...
Marco Aurelio Gerosa
 

Andere mochten auch (20)

Transferring Software Testing Tools to Practice
Transferring Software Testing Tools to PracticeTransferring Software Testing Tools to Practice
Transferring Software Testing Tools to Practice
 
HotSoS16 Tutorial "Text Analytics for Security" by Tao Xie and William Enck
HotSoS16 Tutorial "Text Analytics for Security" by Tao Xie and William EnckHotSoS16 Tutorial "Text Analytics for Security" by Tao Xie and William Enck
HotSoS16 Tutorial "Text Analytics for Security" by Tao Xie and William Enck
 
Advances in Unit Testing: Theory and Practice
Advances in Unit Testing: Theory and PracticeAdvances in Unit Testing: Theory and Practice
Advances in Unit Testing: Theory and Practice
 
User Expectations in Mobile App Security
User Expectations in Mobile App SecurityUser Expectations in Mobile App Security
User Expectations in Mobile App Security
 
Software Analytics - Achievements and Challenges
Software Analytics - Achievements and ChallengesSoftware Analytics - Achievements and Challenges
Software Analytics - Achievements and Challenges
 
Mining Software Repositories
Mining Software RepositoriesMining Software Repositories
Mining Software Repositories
 
Data mining
Data miningData mining
Data mining
 
Knowledge Collaboration by Mining Software Repositories
Knowledge Collaboration by Mining Software RepositoriesKnowledge Collaboration by Mining Software Repositories
Knowledge Collaboration by Mining Software Repositories
 
Transferring Software Testing and Analytics Tools to Practice
Transferring Software Testing and Analytics Tools to PracticeTransferring Software Testing and Analytics Tools to Practice
Transferring Software Testing and Analytics Tools to Practice
 
Towards Mining Software Repositories Research that Matters
Towards Mining Software Repositories Research that MattersTowards Mining Software Repositories Research that Matters
Towards Mining Software Repositories Research that Matters
 
Awareness Support in Global Software Development: A Systematic Review Based o...
Awareness Support in Global Software Development: A Systematic Review Based o...Awareness Support in Global Software Development: A Systematic Review Based o...
Awareness Support in Global Software Development: A Systematic Review Based o...
 
Common Technical Writing Issues
Common Technical Writing IssuesCommon Technical Writing Issues
Common Technical Writing Issues
 
Impact-Driven Research on Software Engineering Tooling
Impact-Driven Research on Software Engineering ToolingImpact-Driven Research on Software Engineering Tooling
Impact-Driven Research on Software Engineering Tooling
 
Mining Development Repositories to Study the Impact of Collaboration on Softw...
Mining Development Repositories to Study the Impact of Collaboration on Softw...Mining Development Repositories to Study the Impact of Collaboration on Softw...
Mining Development Repositories to Study the Impact of Collaboration on Softw...
 
ICSE 2011: Research industry panel
ICSE 2011: Research industry panelICSE 2011: Research industry panel
ICSE 2011: Research industry panel
 
Icpc 2011 storey
Icpc 2011 storeyIcpc 2011 storey
Icpc 2011 storey
 
Mining Software Repositories: Using Humans to Better Software
Mining Software Repositories: Using Humans to Better SoftwareMining Software Repositories: Using Humans to Better Software
Mining Software Repositories: Using Humans to Better Software
 
ICPE2015
ICPE2015ICPE2015
ICPE2015
 
Msr2016 tarek
Msr2016 tarek Msr2016 tarek
Msr2016 tarek
 
WCRE2011
WCRE2011WCRE2011
WCRE2011
 

Ähnlich wie Software Mining and Software Datasets

Bridging the Gap: from Data Science to Production
Bridging the Gap: from Data Science to ProductionBridging the Gap: from Data Science to Production
Bridging the Gap: from Data Science to Production
Florian Wilhelm
 

Ähnlich wie Software Mining and Software Datasets (20)

Software Analytics: Towards Software Mining that Matters (2014)
Software Analytics:Towards Software Mining that Matters (2014)Software Analytics:Towards Software Mining that Matters (2014)
Software Analytics: Towards Software Mining that Matters (2014)
 
Software Analytics: Data Analytics for Software Engineering and Security
Software Analytics: Data Analytics for Software Engineering and SecuritySoftware Analytics: Data Analytics for Software Engineering and Security
Software Analytics: Data Analytics for Software Engineering and Security
 
Software Analytics: Data Analytics for Software Engineering
Software Analytics: Data Analytics for Software EngineeringSoftware Analytics: Data Analytics for Software Engineering
Software Analytics: Data Analytics for Software Engineering
 
Databases, Web Services and Tools For Systems Immunology
Databases, Web Services and Tools For Systems ImmunologyDatabases, Web Services and Tools For Systems Immunology
Databases, Web Services and Tools For Systems Immunology
 
Big Data: the weakest link
Big Data: the weakest linkBig Data: the weakest link
Big Data: the weakest link
 
Analyzing Big Data's Weakest Link (hint: it might be you)
Analyzing Big Data's Weakest Link  (hint: it might be you)Analyzing Big Data's Weakest Link  (hint: it might be you)
Analyzing Big Data's Weakest Link (hint: it might be you)
 
ACM Chicago March 2019 meeting: Software Engineering and AI - Prof. Tao Xie, ...
ACM Chicago March 2019 meeting: Software Engineering and AI - Prof. Tao Xie, ...ACM Chicago March 2019 meeting: Software Engineering and AI - Prof. Tao Xie, ...
ACM Chicago March 2019 meeting: Software Engineering and AI - Prof. Tao Xie, ...
 
Intelligent Software Engineering: Synergy between AI and Software Engineering
Intelligent Software Engineering: Synergy between AI and Software EngineeringIntelligent Software Engineering: Synergy between AI and Software Engineering
Intelligent Software Engineering: Synergy between AI and Software Engineering
 
TechEvent 2019: Artificial Intelligence in Dev & Ops; Martin Luckow - Trivadis
TechEvent 2019: Artificial Intelligence in Dev & Ops; Martin Luckow - TrivadisTechEvent 2019: Artificial Intelligence in Dev & Ops; Martin Luckow - Trivadis
TechEvent 2019: Artificial Intelligence in Dev & Ops; Martin Luckow - Trivadis
 
Keynote at-icpc-2020
Keynote at-icpc-2020Keynote at-icpc-2020
Keynote at-icpc-2020
 
Collaborative Data Analysis with Taverna Workflows
Collaborative Data Analysis with Taverna WorkflowsCollaborative Data Analysis with Taverna Workflows
Collaborative Data Analysis with Taverna Workflows
 
Software Engineering Research: Leading a Double-Agent Life.
Software Engineering Research: Leading a Double-Agent Life.Software Engineering Research: Leading a Double-Agent Life.
Software Engineering Research: Leading a Double-Agent Life.
 
Continuum Analytics and Python
Continuum Analytics and PythonContinuum Analytics and Python
Continuum Analytics and Python
 
Performance Analysis of Idle Programs
Performance Analysis of Idle ProgramsPerformance Analysis of Idle Programs
Performance Analysis of Idle Programs
 
Intelligent Software Engineering: Synergy between AI and Software Engineering...
Intelligent Software Engineering: Synergy between AI and Software Engineering...Intelligent Software Engineering: Synergy between AI and Software Engineering...
Intelligent Software Engineering: Synergy between AI and Software Engineering...
 
Msr2021 tutorial-di penta
Msr2021 tutorial-di pentaMsr2021 tutorial-di penta
Msr2021 tutorial-di penta
 
Software Ecosystems = Big Data
Software Ecosystems = Big DataSoftware Ecosystems = Big Data
Software Ecosystems = Big Data
 
20171003 lancaster data conversations Chue-Hong
20171003 lancaster data conversations Chue-Hong20171003 lancaster data conversations Chue-Hong
20171003 lancaster data conversations Chue-Hong
 
01.intro
01.intro01.intro
01.intro
 
Bridging the Gap: from Data Science to Production
Bridging the Gap: from Data Science to ProductionBridging the Gap: from Data Science to Production
Bridging the Gap: from Data Science to Production
 

Mehr von Tao Xie

Csise15 codehunt
Csise15 codehuntCsise15 codehunt
Csise15 codehunt
Tao Xie
 

Mehr von Tao Xie (16)

MSR 2022 Foundational Contribution Award Talk: Software Analytics: Reflection...
MSR 2022 Foundational Contribution Award Talk: Software Analytics: Reflection...MSR 2022 Foundational Contribution Award Talk: Software Analytics: Reflection...
MSR 2022 Foundational Contribution Award Talk: Software Analytics: Reflection...
 
DSML 2021 Keynote: Intelligent Software Engineering: Working at the Intersect...
DSML 2021 Keynote: Intelligent Software Engineering: Working at the Intersect...DSML 2021 Keynote: Intelligent Software Engineering: Working at the Intersect...
DSML 2021 Keynote: Intelligent Software Engineering: Working at the Intersect...
 
Diversity and Computing/Engineering: Perspectives from Allies
Diversity and Computing/Engineering: Perspectives from AlliesDiversity and Computing/Engineering: Perspectives from Allies
Diversity and Computing/Engineering: Perspectives from Allies
 
MSRA 2018: Intelligent Software Engineering: Synergy between AI and Software ...
MSRA 2018: Intelligent Software Engineering: Synergy between AI and Software ...MSRA 2018: Intelligent Software Engineering: Synergy between AI and Software ...
MSRA 2018: Intelligent Software Engineering: Synergy between AI and Software ...
 
SETTA'18 Keynote: Intelligent Software Engineering: Synergy between AI and So...
SETTA'18 Keynote: Intelligent Software Engineering: Synergy between AI and So...SETTA'18 Keynote: Intelligent Software Engineering: Synergy between AI and So...
SETTA'18 Keynote: Intelligent Software Engineering: Synergy between AI and So...
 
ISEC'18 Tutorial: Research Methodology on Pursuing Impact-Driven Research
ISEC'18 Tutorial: Research Methodology on Pursuing Impact-Driven ResearchISEC'18 Tutorial: Research Methodology on Pursuing Impact-Driven Research
ISEC'18 Tutorial: Research Methodology on Pursuing Impact-Driven Research
 
ISEC'18 Keynote: Intelligent Software Engineering: Synergy between AI and Sof...
ISEC'18 Keynote: Intelligent Software Engineering: Synergy between AI and Sof...ISEC'18 Keynote: Intelligent Software Engineering: Synergy between AI and Sof...
ISEC'18 Keynote: Intelligent Software Engineering: Synergy between AI and Sof...
 
Intelligent Software Engineering: Synergy between AI and Software Engineering
Intelligent Software Engineering: Synergy between AI and Software EngineeringIntelligent Software Engineering: Synergy between AI and Software Engineering
Intelligent Software Engineering: Synergy between AI and Software Engineering
 
Planning and Executing Practice-Impactful Research
Planning and Executing Practice-Impactful ResearchPlanning and Executing Practice-Impactful Research
Planning and Executing Practice-Impactful Research
 
Transferring Software Testing Tools to Practice (AST 2017 Keynote)
Transferring Software Testing Tools to Practice (AST 2017 Keynote)Transferring Software Testing Tools to Practice (AST 2017 Keynote)
Transferring Software Testing Tools to Practice (AST 2017 Keynote)
 
Next Generation Developer Testing: Parameterized Testing
Next Generation Developer Testing: Parameterized TestingNext Generation Developer Testing: Parameterized Testing
Next Generation Developer Testing: Parameterized Testing
 
Csise15 codehunt
Csise15 codehuntCsise15 codehunt
Csise15 codehunt
 
Text Analytics for Security
Text Analytics for SecurityText Analytics for Security
Text Analytics for Security
 
Gamifying Teaching and Learning of Software Engineering and Programming
Gamifying Teaching and Learning of Software Engineering and ProgrammingGamifying Teaching and Learning of Software Engineering and Programming
Gamifying Teaching and Learning of Software Engineering and Programming
 
Tutorial: Text Analytics for Security
Tutorial: Text Analytics for SecurityTutorial: Text Analytics for Security
Tutorial: Text Analytics for Security
 
Teaching and Learning Programming and Software Engineering via Interactive Ga...
Teaching and Learning Programming and Software Engineering via Interactive Ga...Teaching and Learning Programming and Software Engineering via Interactive Ga...
Teaching and Learning Programming and Software Engineering via Interactive Ga...
 

Kürzlich hochgeladen

CHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICECHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
9953056974 Low Rate Call Girls In Saket, Delhi NCR
 
TECUNIQUE: Success Stories: IT Service provider
TECUNIQUE: Success Stories: IT Service providerTECUNIQUE: Success Stories: IT Service provider
TECUNIQUE: Success Stories: IT Service provider
mohitmore19
 
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
Health
 
CALL ON ➥8923113531 🔝Call Girls Kakori Lucknow best sexual service Online ☂️
CALL ON ➥8923113531 🔝Call Girls Kakori Lucknow best sexual service Online  ☂️CALL ON ➥8923113531 🔝Call Girls Kakori Lucknow best sexual service Online  ☂️
CALL ON ➥8923113531 🔝Call Girls Kakori Lucknow best sexual service Online ☂️
anilsa9823
 

Kürzlich hochgeladen (20)

W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...
W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...
W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...
 
Software Quality Assurance Interview Questions
Software Quality Assurance Interview QuestionsSoftware Quality Assurance Interview Questions
Software Quality Assurance Interview Questions
 
5 Signs You Need a Fashion PLM Software.pdf
5 Signs You Need a Fashion PLM Software.pdf5 Signs You Need a Fashion PLM Software.pdf
5 Signs You Need a Fashion PLM Software.pdf
 
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
 
Diamond Application Development Crafting Solutions with Precision
Diamond Application Development Crafting Solutions with PrecisionDiamond Application Development Crafting Solutions with Precision
Diamond Application Development Crafting Solutions with Precision
 
SyndBuddy AI 2k Review 2024: Revolutionizing Content Syndication with AI
SyndBuddy AI 2k Review 2024: Revolutionizing Content Syndication with AISyndBuddy AI 2k Review 2024: Revolutionizing Content Syndication with AI
SyndBuddy AI 2k Review 2024: Revolutionizing Content Syndication with AI
 
CHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICECHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
 
TECUNIQUE: Success Stories: IT Service provider
TECUNIQUE: Success Stories: IT Service providerTECUNIQUE: Success Stories: IT Service provider
TECUNIQUE: Success Stories: IT Service provider
 
Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...
Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...
Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...
 
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
 
Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...
Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...
Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...
 
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...
 
Unlocking the Future of AI Agents with Large Language Models
Unlocking the Future of AI Agents with Large Language ModelsUnlocking the Future of AI Agents with Large Language Models
Unlocking the Future of AI Agents with Large Language Models
 
Right Money Management App For Your Financial Goals
Right Money Management App For Your Financial GoalsRight Money Management App For Your Financial Goals
Right Money Management App For Your Financial Goals
 
CALL ON ➥8923113531 🔝Call Girls Kakori Lucknow best sexual service Online ☂️
CALL ON ➥8923113531 🔝Call Girls Kakori Lucknow best sexual service Online  ☂️CALL ON ➥8923113531 🔝Call Girls Kakori Lucknow best sexual service Online  ☂️
CALL ON ➥8923113531 🔝Call Girls Kakori Lucknow best sexual service Online ☂️
 
How To Troubleshoot Collaboration Apps for the Modern Connected Worker
How To Troubleshoot Collaboration Apps for the Modern Connected WorkerHow To Troubleshoot Collaboration Apps for the Modern Connected Worker
How To Troubleshoot Collaboration Apps for the Modern Connected Worker
 
Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...
Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...
Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...
 
Vip Call Girls Noida ➡️ Delhi ➡️ 9999965857 No Advance 24HRS Live
Vip Call Girls Noida ➡️ Delhi ➡️ 9999965857 No Advance 24HRS LiveVip Call Girls Noida ➡️ Delhi ➡️ 9999965857 No Advance 24HRS Live
Vip Call Girls Noida ➡️ Delhi ➡️ 9999965857 No Advance 24HRS Live
 
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdf
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdfLearn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdf
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdf
 
Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...
Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...
Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...
 

Software Mining and Software Datasets

  • 1. Most slides prepared in collaboration with Ahmed Hassan and Dongmei Zhang More information available at https://sites.google.com/site/asergrp/dmse/ Software Mining and Software Datasets Tao Xie University of Illinois at Urbana-Champaign http://taoxie.cs.illinois.edu/ taoxie@illinois.edu
  • 2. • Associate Professor at University of Illinois at Urbana- Champaign, USA • Leads the ASE research group at Illinois • PC Chair of ISSTA 2015, PC Co-Chair of MSR 2011/2012, ICSM 2009 • Co-organizer of 2007 Dagstuhl Seminar on Mining Programs and Processes, 2013 NII Shonan Meeting on Software Analytics: Principles and Practice 2
  • 4. Individual Social Isolated Not much content generation Collaborative Huge amount of artifacts generated anywhere anytime 4
  • 5. 5 Data pervasive Long product cycle Experience & gut-feeling In-lab testing Informed decision making Centralized development Code centric Debugging in the large Distributed development Continuous release … …
  • 6.
  • 7.
  • 8.
  • 9.
  • 10.
  • 11. http://msrconf.org An international effort to make software repositories actionable http://openscience.us/repo/
  • 12. • Transforms static record- keeping repositories to active repositories • Makes repository data actionable by uncovering hidden patterns and trends 12 MailinglistBugzilla Crashes Field logs CVS/SVN
  • 13. 1313 Field Logs Source Control CVS/SVN Bugzilla Mailing lists Crash Repos Historical Repositories Runtime Repos Code Repos Sourceforge GoogleCode
  • 14.
  • 15. A. Hassan and T Xie. Software Intelligence: Future of Mining Software Engineering Data. In FoSER 2010.
  • 16. Software analytics is to enable software practitioners to perform data exploration and analysis in order to obtain insightful and actionable information for data-driven tasks around software and services. 16 Dongmei Zhang, Yingnong Dang, Jian-Guang Lou, Shi Han, Haidong Zhang, and Tao Xie. Software Analytics as a Learning Case in Practice: Approaches and Experiences. In MALETS 2011 http://research.microsoft.com/en-us/groups/sa/malets11-analytics.pdf
  • 17. Software analytics is to enable software practitioners to perform data exploration and analysis in order to obtain insightful and actionable information for data-driven tasks around software and services. 17 Dongmei Zhang, Yingnong Dang, Jian-Guang Lou, Shi Han, Haidong Zhang, and Tao Xie. Software Analytics as a Learning Case in Practice: Approaches and Experiences. In MALETS 2011 http://research.microsoft.com/en-us/groups/sa/malets11-analytics.pdf
  • 19. 19 Software Users Software Development Process Software System • Covering different areas of software domain • Throughout entire development cycle • Enabling practitioners to obtain insights
  • 20. 20 Runtime traces Program logs System events Perf counters … Usage log User surveys Online forum posts Blog & Twitter … Source code Bug history Check-in history Test cases …
  • 21. 21 Developer Tester Program Manager Usability engineer Designer Support engineer Management personnel Operation engineer
  • 22. • Conveys meaningful and useful understanding or knowledge towards completing the target task • Not easily attainable via directly investigating raw data without aid of analytics technologies • Example – It is easy to count the number of re-opened bugs, but how to find out the primary reasons for these re-opened bugs? 22
  • 23. • Enables software practitioners to come up with concrete solutions towards completing the target task • Examples – Why bugs were re-opened? • A list of bug groups each with the same reason of re- opening – Which part of my code should be refactored? • A list of cloned code snippets easily explored from different perspectives 23
  • 25. Bugzilla CVS/SVNMailinglist Crashes fixed bug discussions Buggy change & Fixing change Field crashes Estimate fix effort Mark duplicates Suggest experts and fix! New Bug Report
  • 26. Bugzilla CVS/SVNMailinglist Crashes fixed bug Field crashes Suggest APIs Warn about risky code or bugs Suggest locations to co-change New Change discussions Buggy change & Fixing change
  • 28. NL Software Artifacts are of Many Types • requirements documents • code comments • identifier names • commit logs • release notes • bug reports • … • emails discussing bugs, designs, etc. • mailing list discussions • test plans • project websites & wikis • …
  • 29. 29 NL Software Artifacts are of Large Quantity • code comments: – 2M in Eclipse, 1M in Mozilla, 1M in Linux • identifier names: – 1M in Chrome • commit logs: – 222K for Linux (05-10), 31K for PostgreSQL • bug reports: – 641K in Mozilla, 18K in Linux, 7K in Apache • … NL data contains useful information, much of which is not in structured data.
  • 30. linux/drivers/scsi/in2000.c: static int in2000_bus_reset(…){ … reset_hardware(…); … } Code comments contain Specifications No lock acquisition ⇒ A bug! linux/drivers/scsi/in2000.c: /* Caller must hold instance lock! */ static int reset_hardware(…) {…} Tan et al. “/*iComment: Bugs or Bad Comments?*/”, SOSP’07
  • 31. API documentation contains resource usages • java.sql.ResultSet.deleteRow() : “Deletes the current row from this ResultSet object and from the underlying database” • java.sql.ResultSet.close() : “Releases this ResultSet object’s database and JDBC resources immediately instead of waiting for this to happen when it is automatically closed”. java.sql.ResultSet.deleteRow()  java.sql.ResultSet.close() Zhong, Zhang, Xie, Mei. Inferring Resource Specifications from Natural Language API Documentation. ASE 2009.
  • 32. NL Data Contains Useful Info – Example 3 Don’t ignore the semantics of identifiers Sridhara, Pollock, Vijay-Shanker. Automatically Detecting and Describing High Level Actions within Methods. ICSE 2011
  • 33. • Unstructured – Hard to parse, sometimes wrong grammar • Ambiguous: often has no defined or precise semantics (as opposed to source code) – Hard to understand • Many ways to represent similar concepts – Hard to extract information from /* We need to acquire the write IRQ lock before calling ep_unlink(). */ /* Lock must be acquired on entry to this function. */ /* Caller must hold instance lock! */
  • 34. • Redundant data • Easy to get “good” results for simple tasks – Simple algorithms without much tuning effort • Evolution/version history readily available • Many techniques to borrow from text analytics: NLP, Machine Learning (ML), Information Retrieval (IR), etc.
  • 35. Stepping Back …. Image from https://www.snowfactor.com/events/back-to-the-future-quiz/
  • 36. https://www.kaggle.com/c/the-allen-ai-science-challenge “consistently understand and correctly answer general questions about the world.” “Using a dataset of multiple choice question and answers from a standardized 8th grade science exam, AI2 is challenging you to create a model that gets to the head of the class.”
  • 37.
  • 38.
  • 39.
  • 40. A. E. Hassan and T. Xie: Mining Software Engineering Data • Bug report image • Overlay the triage questions Duplicate? Bugzilla: open source bug tracking tool http://www.bugzilla.org/ http://www.cs.ubc.ca/labs/spl/projects/bugTriage.html Assigned To: ? Anvik, Hiew, Murphy. Who should fix this bug? ICSE 2006. Wang, Zhang, Xie, Anvik, Sun. An Approach to Detecting Duplicate Bug Reports using Natural Language and Execution Information. ICSE 2008.
  • 41.
  • 42. Thummalapenta, Sinha, Singhania, Chandra. Automating Test Automation Suresh. ICSE 2012.
  • 43. Thummalapenta, Sinha, Singhania, Chandra. Automating Test Automation Suresh. ICSE 2012.
  • 44.
  • 45. Selected units Selection threshold Example of selection based approach from MS Word
  • 47. …Rastkar, Murphy, Murray. Summarizing software artifacts: A case study of bug reports. ICSE’ 10.
  • 48.
  • 49. Security bug report “An attacker can exploit a buffer overflow by sending excessive data into an input field.” Mislabeled security bug report “The system crashes when receiving excessive text in the input field” Two bug reports describing a buffer overflow M. Gegick, P. Rotella, T. Xie. Identifying Security Bug Reports via Text Mining: An Industrial Case Study. MSR’10
  • 50. Term Bug Report 1 Bug Report 2 Bug Report 3 Attack 1 0 1 Buffer Overflow 1 0 0 Vulnerability 3 0 0 … Term-by-document frequency matrix quantifies a document M. Gegick, P. Rotella, T. Xie. Identifying Security Bug Reports via Text Mining: An Industrial Case Study. MSR’10 Start List Label: Security Label: Non-Security Label:?
  • 51.
  • 52. A HCP should not change patient’s account. An [subject: HCP] should not [action: change] [resource: patient’s account]. ACP Rule EffectSubject Action Resource HCP UPDATE - change patient’s account deny Linguistic Analysis Model-Instance Construction Transformation Xiao, Paradkar, Thummalapenta, Xie. Automated Extraction of Security Policies from Natural-Language Software Documents. FSE 2012.
  • 54. A. E. Hassan and T. Xie: Mining Software Engineering Data 54 • Most open source projects communicate through mailing lists or IRC/IM channels • Rich source of information about the inner workings of large projects • Discussions cover topics such as future plans, design decisions, project policies, code or patch reviews • Social network analysis could be performed on discussion threads
  • 55. A. E. Hassan and T. Xie: Mining Software Engineering Data 55 • Study the content of messages before and after a release • Use dimensions from a psychometric text analysis tool: – After Apache 1.3 release there was a drop in optimism – After Apache 2.0 release there was an increase in sociability Rigby, Hassan. What can OSS mailing lists tell us? a preliminary psychometric text analysis of the apache developer mailing list. MSR 2007.
  • 56. A. E. Hassan and T. Xie: Mining Software Engineering Data 56 • When will a developer be invited to join a project? – Expertise vs. interest Bird, Gourley, Devanbu, Swaminathan, Hsu. Open borders? immigration in open source projects. MSR 2007.
  • 57. Example Repositories: Source Control and Bug Repositories
  • 58. A. E. Hassan and T. Xie: Mining Software Engineering Data 58 Source Control Repositories • A source control system tracks changes to ChangeUnits • Example of ChangeUnits: – File (most common) – Function – Dependency (e.g., Call) • Each ChangeUnit: – Records the developer, change time, change message, co- changing Units ChangeListDeveloper Time ChangeChangeUnit Modify Add Remove Change Type * .. * ChangeList Message FI FR GM ChangeList Type FI: Feature Introduction FR: Fault Repairing GM: General Maint
  • 59. A. E. Hassan and T. Xie: Mining Software Engineering Data 59 Determine Initial Entity To Change Change Entity Determine Other Entities To Change Consult Guru for Advice New Req., Bug Fix “How does a change in one source code entity propagate to other entities?” No More Changes For Each Entity Suggested Entity
  • 60. A. E. Hassan and T. Xie: Mining Software Engineering Data 60 • Mine association rules from change history • Use rules to help propagate changes: – Recall as high as 44% – Precision around 30% • High precision and recall reached in < 1mth • Prediction accuracy improves prior to a release (i.e., during maintenance phase) • Better predictor than static dependencies alone Zimmermann, Zeller, Weissgerber, Diehl. Mining Version Histories to Guide Software Changes. TSE 2005. Hassan, Holt. Predicting change propagation in software systems. ICSM2004.
  • 61. A. E. Hassan and T. Xie: Mining Software Engineering Data 61 import org.eclipse.jdt.internal.compiler.lookup.*; import org.eclipse.jdt.internal.compiler.*; import org.eclipse.jdt.internal.compiler.ast.*; import org.eclipse.jdt.internal.compiler.util.*; ... import org.eclipse.pde.core.*; import org.eclipse.jface.wizard.*; import org.eclipse.ui.*; 14% of all files that import ui packages, had to be fixed later on. 71% of files that import compiler packages, had to be fixed later on. Schröter, Zimmermann, Zeller. Predicting Component Failures at Design Time. ISESE 2006.
  • 62. A. E. Hassan and T. Xie: Mining Software Engineering Data 62 • Given a change can we warn a developer that there is a bug in it? – Recall/Precision in 50-60% range Kim, Zimmermann, Pan, Whitehead. Automatic identification of bug-introducing changes. ASE 2006.
  • 63. A. E. Hassan and T. Xie: Mining Software Engineering Data 63 Percentage of bug-introducing changes for Eclipse Don’t program on Fridays ;-) Sliwerski, Zimmermann, Zeller. Don’t Program on Fridays! How to Locate Fix-Inducing Changes. WSR 2005.
  • 64. A. E. Hassan and T. Xie: Mining Software Engineering Data 64 Failure is a 4-letter Word Zeller, Zimmermann, Bird. Failure is a four-letter word: a parody in empirical research. PROMISE 2011.
  • 65. A. E. Hassan and T. Xie: Mining Software Engineering Data 65 Failure is a 4-letter Word Zeller, Zimmermann, Bird. Failure is a four-letter word: a parody in empirical research. PROMISE 2011.
  • 66. • “Cross-validation is inappropriate for estimating the performance of change classification because the data points, i.e., changes, follow a certain order in time.” • “Randomly partitioning the data set may cause a model to use future knowledge which should not be known at the time of prediction to predict changes in the past.” – E.g., use information on a change committed in 2014 to predict whether a change committed in 2012 is buggy or clean. Tan, Tan, Dara, Mayuex. Online defect prediction for imbalanced data. ICSE 2015 SEIP. Ye, Bunescu, Liu. Learning to rank relevant files for bug reports using domain knowledge. FSE 2014.
  • 67. J. Czerwonka, R. Das, N. Nagappan, A. Tarvo, A. Teterev. CRANE: Failure Prediction, Change Analysis and Test Prioritization in Practice - Experiences from Windows. ICST 2011.
  • 69. A. E. Hassan and T. Xie: Mining Software Engineering Data 69 Source data Mined info Variable names and function names Software categories [Kawaguchi et al. 04] Statement seq in a basic block Copy-paste code [Li et al. 04] Set of functions, variables, and data types within a C function Programming rules [Li&Zhou 05] Sequence of methods within a Java method API usages [Xie&Pei 06] API method signatures API Jungloids [Mandelin et al. 05]
  • 70. • Tons of papers published in the past decade • Many years of International Workshop on Software Clones (IWSC) since 2006 • Dagstuhl Seminars – Software Clone Management towards Industrial Application (2012) – Duplication, Redundancy, and Similarity in Software (2006) 70 Source: http://www.dagstuhl.de/12071
  • 71. • Motivation – Copy-and-paste is a common developer behavior – A real tool widely adopted internally and externally • XIAO enables code clone analysis in the following way – High tunability – High scalability – High compatibility – High explorability 71Dang, Zhang, Ge, Chu, Qiu, Xie. XIAO: Tuning Code Clones at Hands of Engineers in Practice. ACSAC 2012.
  • 72. 72 • Intuitive similarity metric • Effective control of the degree of syntactical differences between two code snippets • Tunable at fine granularity • Statement similarity • % of inserted/deleted/modified statements • Balance between code structure and disordered statements for (i = 0; i < n; i ++) { a ++; b ++; c = foo(a, b); d = bar(a, b, c); e = a + c; } for (i = 0; i < n; i ++) { c = foo(a, b); a ++; b ++; d = bar(a, b, c); e = a + d; e ++; } Dang, Zhang, Ge, Chu, Qiu, Xie. XIAO: Tuning Code Clones at Hands of Engineers in Practice. ACSAC 2012.
  • 73. 73 1. Clone navigation based on source tree hierarchy 2. Pivoting of folder level statistics 3. Folder level statistics 4. Clone function list in selected folder 5. Clone function filters 6. Sorting by bug or refactoring potential 7. Tagging 1 2 3 4 5 6 7 1. Block correspondence 2. Block types 3. Block navigation 4. Copying 5. Bug filing 6. Tagging 1 2 3 4 1 6 5 Dang, Zhang, Ge, Chu, Qiu, Xie. XIAO: Tuning Code Clones at Hands of Engineers in Practice. ACSAC 2012.
  • 74. 74 Quality gates at milestones • Architecture refactoring • Code clone clean up • Bug fixing Post-release maintenance • Security bug investigation • Bug investigation for sustained engineering Development and testing • Checking for similar issues before check-in • Reference info for code review • Supporting tool for bug triage Online code clone search Offline code clone analysis Dang, Zhang, Ge, Chu, Qiu, Xie. XIAO: Tuning Code Clones at Hands of Engineers in Practice. ACSAC 2012.
  • 75. 75 Available in Visual Studio 2012 Searching similar snippets for fixing bug once Finding refactoring opportunity Dang, Zhang, Ge, Chu, Qiu, Xie. XIAO: Tuning Code Clones at Hands of Engineers in Practice. ACSAC 2012.
  • 76. 76 Code Clone Search service integrated into workflow of Microsoft Security Response Center Over hundreds of million lines of code indexed across multiple products Real security issues proactively identified and addressed Dang, Zhang, Ge, Chu, Qiu, Xie. XIAO: Tuning Code Clones at Hands of Engineers in Practice. ACSAC 2012.
  • 77. Combined Security Update for Microsoft Office, Windows, .NET Framework, and Silverlight, published: Tuesday, May 08, 2012 3 publicly disclosed vulnerabilities and seven privately reported involved. Specifically, one is exploited by the Duqu malware to execute arbitrary code when a user opened a malicious Office document Insufficient bounds check within the font parsing subsystem of win32k.sys Cloned copy in gdiplus.dll, ogl.dll (office), Silver Light, Windows Journal viewer Microsoft Technet Blog about this bulletin “However, we wanted to be sure to address the vulnerable code wherever it appeared across the Microsoft code base. To that end, we have been working with Microsoft Research to develop a “Cloned Code Detection” system that we can run for every MSRC case to find any instance of the vulnerable code in any shipping product. This system is the one that found several of the copies of CVE-2011-3402 that we are now addressing with MS12-034.” 77Dang, Zhang, Ge, Chu, Qiu, Xie. XIAO: Tuning Code Clones at Hands of Engineers in Practice. ACSAC 2012.
  • 79. A. E. Hassan and T. Xie: Mining Software Engineering Data 79 • PROMISE repository – http://openscience.us/repo/ • Boa – http://boa.cs.iastate.edu/ • FLOSSmole: – http://flossmole.org/ • Software-artifact infrastructure repository: – http://sir.unl.edu/portal/index.html • Socorro: Mozilla Crash Stats project – https://wiki.mozilla.org/Socorro
  • 80. A. E. Hassan and T. Xie: Mining Software Engineering Data 80 • FLOSSmole – provides raw data about open source projects – provides summary reports about open source projects – integrates donated data from other research teams – provides tools so you can gather your own data • Data sources – Sourceforge – Freshmeat – Rubyforge – ObjectWeb – Free Software Foundation (FSF) – SourceKibitzer http://flossmole.org/
  • 81. A. E. Hassan and T. Xie: Mining Software Engineering Data 81
  • 82. A. E. Hassan and T. Xie: Mining Software Engineering Data 82 Yearly MSR Challenge Since 2006 • http://2016.msrconf.org/#/challenge (Boa data for SourceForge, GitHub) • http://2015.msrconf.org/challenge.php (Stackoverflow data) • http://2014.msrconf.org/challenge.php (GitHub data) • http://2013.msrconf.org/challenge.php (Stackoverflow data) • http://2012.msrconf.org/challenge.php (Change/bug report data for Android) • http://2011.msrconf.org/msr-challenge.html (Eclipse, Netbeans, Firefox, Chrome data) • http://msr.uwaterloo.ca/msr2010/challenge/ (FreeBSD, GNOME Debian/Ubuntu) • http://msr.uwaterloo.ca/msr2009/challenge/ (GNOME) • http://msr.uwaterloo.ca/msr2008/ (Eclipse) • http://msr.uwaterloo.ca/msr2007/challenge/ (Eclipse, Firefox) • http://msr.uwaterloo.ca/challenge/ (PostgreSQL, ArgoUML) • http://msr.uwaterloo.ca/msr2005/ • http://msr.uwaterloo.ca/msr2004/
  • 83. A. E. Hassan and T. Xie: Mining Software Engineering Data 83 • PROMISE repository – http://openscience.us/repo/ • TraceLab – http://www.coest.org/index.php/tracelab/about-tracelab • Apache SVN commits on Github – https://github.com/monperrus/apache-commits/ • Benchmarks for software maintenance tasks – http://www.cs.wm.edu/semeru/data/benchmarks/ • Non-functional requirements wordlists – http://softwareprocess.es/static/What%27s_in_a_Name.html • Source code ECOsystem Linked Data (SeCold) – http://www.secold.org/ • Text Analysis for Software Engineering Wiki – http://textse.wikispaces.com/Home
  • 84.
  • 85.
  • 86.
  • 87.
  • 88.
  • 89.
  • 90. Must show value before data quality improves Correlation vs. Causation
  • 91. A. E. Hassan and T. Xie: Mining Software Engineering Data 91 • Make sure you manually examine the repositories. Do not fully automate the process! Image from http://www.quincyma.gov/Government/OCS/neighborhoodtips.cfm
  • 92. A. E. Hassan and T. Xie: Mining Software Engineering Data 92 • Drop all transactions above a large threshold
  • 93. A. E. Hassan and T. Xie: Mining Software Engineering Data 93 • Few developers are given commit privileges • Actual developer is usually mentioned in the change message • One must study project commit policies before reaching any conclusions [German 2006]
  • 94. 94 APP DEVELOPERS APP USERS App Functional Requirements App Security Requirements User Functional Requirements User Security Requirements informal: app description, etc. permission list, etc. App Code Pandita, Xiao, Yang, Enck, Xie. WHYPER: Towards Automating Risk Assessment of Mobile Applications. USENIX SEC 2013. Tutorial Slides: http://www.slideshare.net/taoxiease/text-analytics-for-security
  • 95.
  • 96. • External collaborators: Ahmed Hassan, Dongmei Zhang, … • Students… • Broad colleagues in this area, … • Funding supports: