ICPC 2012 - Mining Source Code Descriptions

Mining Source Code Descriptions
from Developer Communications

Sebastiano Jairo Massimiliano Andrian Gerardo
Panichella Aponte Di Penta Marcus Canfora

Context: Software Project
describes

Sequence
diagram Program
Documentation Comprehension

Source Code Class
diagram
Maintenance
Tasks
Difficult
understanding understanding

Developer

Context: Software Project
Coming back
to the reality...
describes

Sequence
diagram Program
Documentation Comprehension

Source Code Class
diagram
Maintenance
Tasks
Difficult
understanding understanding

Developer

..................................................
When call the method IndexSplitter.split(File
Idea destDir, String[] segs) from the Lucene cotrib
directory(contrib/misc/src/java/org/apache/luc
ene/index) it creates an index with segments
We argue that messages exchanged wrong data. Namely wrong
descriptor file with among contributors/developers
are a useful source of information to help understanding of segment
is the number representing the name source code.
that would be created next in this index.
..................................................
In such situations developers need to infer knowledge from,
CLASS: IndexSplitter METHOD: split
source code descriptions in external
the source code itself
artifacts.

Developer

A Five Step-Approach for Mining
Method Descriptions

Developer

Step 1: Downloading emails/bugs reports and tracing them
onto classes
The discussion contains a fully-qualified class name (e.g.,
org.apache.lucene.analysis.MappingCharFilter); or the email contains a
Two heuristics file name (e.g., MappingCharFilter.java)

For bug reports, we complement the above heuristic by matching the
bug ID of each closed bug to the commit notes, therefore tracing the bug
report to the files changed in that commit
Developer
Discussion
When call the method IndexSplitter .split(File destDir, String[] segs) from the
Lucene cotrib directory (contrib/misc/src/java/org/apache/lucene/index) it creates
an index with segments descriptor file with wrong data. Namely wrong is the number
representing the name of segment that would be created next in this index.

CLASS: IndexSplitter
public void split(File destDir, String[] segs) throws IOException {
destDir.mkdirs();
FSDirectory destFSDir = FSDirectory.open(destDir);
SegmentInfos destInfos = new SegmentInfos }

If some of the segments of the index already has this name this results either to
impossibility to create new segment or in crating of an corrupted segment.

Step 2: Extracting paragraphs
We consider as paragraphs, text section separated by one or more
white lines
Two heuristics

We prune out paragraph description from source code fragments
and/or stack Traces "by using an approach inspired by the work of
Bacchelli et al.
Developer
Discussion
When call the method IndexSplitter.split(File destDir, String[] segs) from the PAR
Lucene cotrib directory (contrib/misc/src/java/org/apache/lucene/index) it creates
an index with segments descriptor file with wrong data. Namely wrong is the number
1
representing the name of segment that would be created next in this index.

public void split(File destDir, String[] segs) throws IOException {
destDir.mkdirs();
PAR
FSDirectory destFSDir = FSDirectory.open(destDir); 2
SegmentInfos destInfos = new SegmentInfos }

If some of the segments of the index already has this name this results either to
PAR
impossibility to create new segment or in crating of an corrupted segment. 3

Step 3: Tracing paragraphs onto methods
A) A valid paragraph must contain the keyword “method”
These paragraphs must
respect the following
B) and the method name must be followed by a open parenthesis—
two conditions:
i.e., we match “foo(”

Developer
Discussion A) B)
When call the method IndexSplitter.split(File destDir, String[] segs)
PAR
from the Lucene cotrib directory it creates an index with segments1
descriptor file with wrong data. Namely wrong is the number
representing the name of segment that would be created next in this
index.

CLASS: IndexSplitter
METHOD: split(
......................................................................................
......................................................................................
......................................................................................
......................................................................................

Step 4: Heuristic based Filtering

We defined a set of heuristics to further filter the paragraphs associated with
methods that assign each paragraph a score:

.......................... a) Method parameters:
Problem seems to come from
1 if the
% of parameters method
MainMethodeSearchEngine in
org.eclipse.jdt.internal.ui.launcher s1= mentioned in the does not
The Method paragraphs. Value have
searchMainMethods(IProgressMonitor, between 0 and 1 parameters
IJavaSearchScope, boolean) ,there's
a call to addSubTypes(List,
IProgressMonitor, IJavaSearchScope)
Method if includesSubtypes flag is
ON. This method add all types sub-
types as soon as the given scope
encloses them without testing if
sub-types have a main! After return
IType[] before the excecution
..........................
CLASS: MainMethodSearchEngine
METHOD: serachMainMethods
% parameter = 100% -> s1= 1
SCORE



1 if the
a call to addSubTypes(List, b) Syntactic descriptions (mentioning return values):
Method if includesSubtypes flag is check whether the Equal to one if
ON. This method add all types sub- paragraph contains the the method is
types as soon as the given scope s2= keyword “return”. If YES
void.
sub-types have a main! After return Value equal 1, 0 otherwise
IType[] before the excecution
..........................
CLASS: MainMethodSearchEngine
METHOD: serachMainMethods
% parameter = 100% -> s1= 1
SCORE = 1+



1 if the
a call to addSubTypes(List, b) Syntactic descriptions (mentioning return values):
Method if includesSubtypes flag is check whether the Equal to one if
ON. This method add all types sub- paragraph contains the the method is
types as soon as the given scope s2= keyword “return”. If YES
void.
sub-types have a main! After return Value equal 1, 0 otherwise
IType[] before the excecution c) Overriding/Overloading:
.......................... 1 if any of the “overload” or
CLASS: MainMethodSearchEngine s3=“override” keywords appears
METHOD: serachMainMethods in the paragraph, 0 otherwise
% parameter = 100% -> s1= 1 d) Method invocations:
SCORE = 1+ 0+ 1 = 2 1 if any of the “call” or
s4=“excecute” keywords appears
in the paragraph, 0 otherwise



a) Method parameters:
We selected paragraphs that have: % of parameters
1 if the
method
s1= mentioned in the does not
1. s1 ≥ thP = 0.5 paragraphs. Value have
between 0 and 1 parameters
2. s2 + s3 + s4 ≥ thH = 1 b) Syntactic descriptions (mentioning return values):
check whether the Equal to one if
paragraph contains the the method is
s2= keyword “return”. If YES void.
Value equal 1, 0 otherwise
OK c) Overriding/Overloading:
1 if any of the “overload” or
s3=“override” keywords appears
% parameter = 100% -> s1= 1 ≥ 0.5 d) Method invocations:
SCORE = 1+ 0+ 1 = 2 ≥ 1 1 if any of the “call” or
s4=“excecute” keywords appears

Step 5: Similarity based Filtering

We rank filtered paragraphs through their textual similarity with the method they
are likely describing.

METHOD PARAGRAPH SCORE Similarity Removing:
- English stop words;
Method_3 Paragraph_4 2.5 96.1%
- Programming
Method_1 Paragraph_1 2.5 95.6% language keywords
Method_2 Paragraph_2 1.5 97.4% Using:
Method_3 Paragraph_3 1.5 86.2% - Camel Case
splitting the on
remaining words
- Vector Space Model
Method_3 Paragraph_1 1.3 73.9% th>=0.80

Empirical Study
• Goal: analyze source code descriptions in developer
discussions

• Purpose: investigating how developer discussions
describe methods of Java Source Code

• Quality focus: find good method description in
messages exchanged among contributors/developers

• Context: Bug-report and mailing lists of two Java
Project
 Apache Lucene and Eclipse

Research Questions

 RQ1 (method coverage): How many methods from
the analyzed software systems are described by the
paragraphs identified by the proposed approach?

 RQ2 (precision): How precise is the proposed approach
in identifying method descriptions?

 RQ3 (missing descriptions): How many potentially
good method descriptions are missed by the approach?

RQ1: How many methods from the analyzed software
systems are described by the paragraphs identified
by the proposed approach?

RQ2: How precise is the proposed approach in identifying
method descriptions?

We sampled 250 descriptions from each project

RQ3: How many potentially good method descriptions are
missed by the approach?
We sampled 100 descriptions from each project
TABLE III
The analysis of a sample of 100 paragraphs traced to methods,
but not satisfying the Step 4 heuristic

System True Negatives False Negatives
Eclipse 78 22
Apache Lucene 67 33

ICPC 2012 - Mining Source Code Descriptions

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Was ist angesagt?

Was ist angesagt? (19)

Andere mochten auch

Andere mochten auch (20)

Ähnlich wie ICPC 2012 - Mining Source Code Descriptions

Ähnlich wie ICPC 2012 - Mining Source Code Descriptions (20)

Mehr von Sebastiano Panichella

Mehr von Sebastiano Panichella (20)

ICPC 2012 - Mining Source Code Descriptions