LINSEN an efficient approach to split identifiers and expand abbreviations

LINSEN
AN EFFICIENT APPROACH
TO SPLIT IDENTIFIERS AND
EXPAND ABBREVIATIONS
Anna Corazza, Sergio Di Martino, Valerio Maggio
Università di Napoli “Federico II”
26th Sept. 2012, ICSM2012@Riva del Garda(Trento), Italy

1. Tokenization
IRFORNATURALLANGUAGE

1. Tokenization
Draws, the, are,
NullHandle,
box, r,
Rectangle, g,
Graphics,
box,
displayBox,
...

1. Tokenization
2. Normalization
Draws, the, are,
NullHandle,
box, r,
Rectangle, g,
Graphics,
box,
displayBox,
...

1. Tokenization
2. Normalization
Draws, the, are,
NullHandle,
box, r,
Rectangle, g,
Graphics,
box,
displayBox,
...
1. Change to Lower case
draws, the, are,
nullhandle,
box, r,
rectangle, g,
graphics,
box,
displaybox,
...

1. Tokenization
2. Normalization
Draws, the, are,
NullHandle,
box, r,
Rectangle, g,
Graphics,
box,
displayBox,
...
draws, the, are,
nullhandle,
box, r,
rectangle, g,
graphics,
box,
displaybox,
...2.Remove StopWords

1. Tokenization
2. Normalization
Draws, the, are,
NullHandle,
box, r,
Rectangle, g,
Graphics,
box,
displayBox,
...
draw, the, are,
nullhandl,
box, r,
rectangl, g,
graphic,
box,
displaybox,
3.Apply Stemming
4. ...

Implicit assumption: The “same” words are used whenever a
particular concept is described
1. Tokenization
2. Normalization
Draws, the, are,
NullHandle,
box, r,
Rectangle, g,
Graphics,
box,
displayBox,
...
draw, the, are,
nullhandl,
box, r,
rectangl, g,
graphic,
box,
displaybox,
3.Apply Stemming
4. ...

1. Tokenization
IRFORSOURCECODE
2. Normalization
1.5 Identiﬁer
Splitting

1. Tokenization
IRFORSOURCECODE
2. Normalization
1.5 Identiﬁer
Splitting
• snake_case Splitter: r’(?<=w)_’
• display_box ==> display | box
• camelCase/PascalCase Splitter: r’(?<!^)([A-Z][a-z]+)’
• displayBox ==> display | Box
draw, the, are,
null, handl,
box, r,
rectangl, g,
graphic,
box,
display, box,
...

• camelCase Splitter: r’(?<=!^)([A-Z][a-z]+)’
• drawXORRect ==> drawXOR | Rect
• drawxorrect ==> NO SPLIT
IRFORSOURCECODE

• camelCase Splitter: r’(?<=!^)([A-Z][a-z]+)’
• drawXORRect ==> drawXOR | Rect
• drawxorrect ==> NO SPLIT
IRFORSOURCECODE Splitting algorithms based on naming conventions
are not robust enough

Splitting algorithms based on naming conventions
• Heavy use of Abbreviations in the source code
IRFORSOURCECODE

Splitting algorithms based on naming conventions
• Heavy use of Abbreviations in the source code
• rect as for Rectangle
• r as for Rectangle
IRFORSOURCECODE

1. Tokenization
IDENTIFIERMAPPING
2. Normalization
1.5 Identiﬁer
Mapping
• SAMURAI (Enslen, et.al , 2011)
• TIDIER (Guerrouj, et.al , 2011)
• GenTest+Normalize (Lawrie and Binkley, 2011)
• AMAP (Hill and Pollock, 2008)
• ...
• LINSEN
draw, the, are,
null, handl,
box, r,
rectangl, g,
graphic,
box,
display, box,
...

LINSEN
A L G O
R I T H M CONTRIBUTION
• Novel technique for the Identifier Mapping

LINSEN
A L G O
• Based on an efficient String Matching technique:
Baeza-Yates&Perlberg Algorithm (BYP)

LINSEN
A L G O
• Applied on a Graph-based model

LINSEN
A L G O
• Applied on a Graph-based model
• Able to both Split Identifiers and Expand possible occurring
abbreviations

There could be multiple and equally correct splitting or
expansion solutions
THEAMBIGUITYPROBLEM

expansion solutions
• r as for Rectangle OR red
THEAMBIGUITYPROBLEM

expansion solutions
• r as for Rectangle OR red
THEAMBIGUITYPROBLEM
•nsISupport ==> ns|IS|up|ports OR
==> ns|I|Supports

DICTIONARIES
Application-aware
Dictionaries

DICTIONARIES
Application-aware
Dictionaries(108,315 Entries)

DICTIONARIES
Application-aware
Dictionaries
(22,940 Entries)
(108,315 Entries)

DICTIONARIES
Application-aware
Dictionaries
(22,940 Entries)
(108,315 Entries)
(588 Entries)

GRAPHMODELModel: Weighted Directed Graph Example: drawXORRect identifier

GRAPHMODEL
• NODES correspond to characters of the current identifier
Model: Weighted Directed Graph Example: drawXORRect identifier
0
11
1
9
4
5
7
8
2 63
10

GRAPHMODEL
• ARCS corresponds to matchings between identifier substrings
and dictionary words
0
11
1
9
4
5
7
8
2 63
10
“raw”,c(“raw”)
“draw”,c(“draw”)
“or”,c(“or”)
“xor”,c(“xor”)
“rectangle”,c(“rectangle”)

GRAPHMODEL
• Application of the String Matching Algorithm (BYP)
0
11
1
9
4
5
7
8
2 63
10

GRAPHMODEL
• Application of the String Matching Algorithm (BYP)
• Padding Arcs to ensure the Graph always connected
0
11
1
9
4
“r”,C-MAX
“d”,C-MAX
5
7
8
2
“R”,C-MAX
6
“a”,C-MAX
3
“w”,C-MAX
“X”,C-MAX
“O”,C-MAX
“e”,C-MAX
10
“c”,C-MAX“t”,C-MAX
“R”,C-MAX

GRAPHMODEL
• Every Arc is Labelled with the corresponding dictionary word
• Weights represent the “cost” of each matching
• Cost function [c(“word”)] favors longest words and words coming
from the application-aware dictionaries
0
11
1
9
4
“r”,C-MAX
“d”,C-MAX
5
7
8
2
“R”,C-MAX
6
“a”,C-MAX
3
“w”,C-MAX
“X”,C-MAX
“O”,C-MAX
“e”,C-MAX
10
“R”,C-MAX

GRAPHMODEL
• The final Mapping Solution corresponds to the sequence of labels in the
path with the minimum cost (Djikstra Algorithm)
0
11
1
9
4
“r”,C-MAX
“d”,C-MAX
5
7
8
2
“R”,C-MAX
6
“a”,C-MAX
3
“w”,C-MAX
“X”,C-MAX
“O”,C-MAX
“e”,C-MAX
10
“R”,C-MAX

STRING MATCHING
• Application of the Baeza-Yates and Perlberg (BYP) Algorithm
• Signature: BYP(identifier, word, φ(word))
• identifier: target string
• word: string to match
• φ(·): Tolerance (Error) function
• Bounds the length of acceptable matchings
Advantage: Use the same algorithm for both the splitting and
the expansion step with different input Tolerance function

• BYP(identifier, word, φSplit(word))
φSplit : Exact Matching (i.e., No Errors allowed)
BYPFORSPLITTING
..
draw,
the,
are,
null,
handle,
box,
red,
rectangle,
...
...
echo,
testing,
threading,
xpm,
xor,
....
...
abort
absolute
abstract
...
or
...
raw
...
DFile DCOMPUTER-SCIENCE DEnglish
Identifier:
drawXORRect

BYPFORSPLITTING
..
draw,
the,
are,
null,
handle,
box,
red,
rectangle,
...
...
echo,
testing,
threading,
xpm,
xor,
....
...
abort
absolute
abstract
...
or
...
raw
...
0
11
1
9
4
5
7
8
2 63
10
Identifier:
drawXORRect

BYPFORSPLITTING
..
draw,
the,
are,
null,
handle,
box,
red,
rectangle,
...
...
echo,
testing,
threading,
xpm,
xor,
....
...
abort
absolute
abstract
...
or
...
raw
...
0
11
1
9
4
“r”,C-MAX
“d”,C-MAX
5
7
8
2
“R”,C-MAX
6
“a”,C-MAX
3
“w”,C-MAX
“X”,C-MAX
“O”,C-MAX
“e”,C-MAX
10
“R”,C-MAX
Identifier:
drawXORRect

BYPFORSPLITTING
...
echo,
testing,
threading,
xpm,
xor,
....
...
abort
absolute
abstract
...
or
...
raw
...
0
11
1
9
4
“r”,C-MAX
“d”,C-MAX
5
7
8
2
“R”,C-MAX
6
“a”,C-MAX
3
“w”,C-MAX
“X”,C-MAX
“O”,C-MAX
“e”,C-MAX
10
“R”,C-MAX
..
draw,
the,
are,
null,
handle,
box,
red,
rectangle,
...
Identifier:
drawXORRect

BYPFORSPLITTING
...
abort
absolute
abstract
...
or
...
raw
...
0
11
1
9
4
“r”,C-MAX
“d”,C-MAX
5
7
8
2
“R”,C-MAX
6
“a”,C-MAX
3
“w”,C-MAX
“X”,C-MAX
“O”,C-MAX
“e”,C-MAX
10
“draw”,c(“draw”) “xor”,c(“xor”)
“R”,C-MAX
..
draw,
the,
are,
null,
handle,
box,
red,
rectangle,
...
DFile ...
echo,
testing,
threading,
xpm,
xor,
....
DCOMPUTER-SCIENCE DEnglish
Identifier:
drawXORRect

BYPFORSPLITTING
0
11
1
9
4
“r”,C-MAX
“d”,C-MAX
5
7
8
2
“R”,C-MAX
6
“a”,C-MAX
3
“w”,C-MAX
“X”,C-MAX
“O”,C-MAX
“e”,C-MAX
10
“R”,C-MAX
..
draw,
the,
are,
null,
handle,
box,
red,
rectangle,
...
DFile ...
echo,
testing,
threading,
xpm,
xor,
....
DCOMPUTER-SCIENCE ...
abort
absolute
abstract
...
or
...
raw
...
DEnglish
Identifier:
drawXORRect

• BYP(identifier, word, φExp(word))
• φExp : Approximate Matching
BYPFOREXPANSION
0
11
1
9
4
“r”,C-MAX
“d”,C-MAX
5
7
8
2
“R”,C-MAX
6
“a”,C-MAX
3
“w”,C-MAX
“X”,C-MAX
“O”,C-MAX
“e”,C-MAX
10
“R”,C-MAX
..
draw,
the,
are,
null,
handle,
box,
red,
rectangle,
...
...
echo,
testing,
threading,
xpm,
xor,
....
DCOMPUTER-SCIENCE
...
abort
absolute
abstract
...
or
...
raw
...
DEnglishDFile
Identifier:
drawXORRect

BYPFOREXPANSION
0
11
1
9
4
“r”,C-MAX
“d”,C-MAX
5
7
8
2
“R”,C-MAX
6
“a”,C-MAX
3
“w”,C-MAX
“X”,C-MAX
“O”,C-MAX
“e”,C-MAX
10
“R”,C-MAX
..
draw,
the,
are,
null,
handle,
box,
red,
rectangle,
...
...
echo,
testing,
threading,
xpm,
xor,
....
DCOMPUTER-SCIENCE
...
abort
absolute
abstract
...
or
...
raw
...
DEnglishDFile
Identifier:
drawXORRect

BYPFOREXPANSION
0
11
1
9
4
“r”,C-MAX
“d”,C-MAX
5
7
8
2
“R”,C-MAX
6
“a”,C-MAX
3
“w”,C-MAX
“X”,C-MAX
“O”,C-MAX
“e”,C-MAX
10
“R”,C-MAX
...
echo,
testing,
threading,
xpm,
xor,
....
DCOMPUTER-SCIENCE
...
abort
absolute
abstract
...
or
...
raw
...
DEnglish
“red”,c(“red”)
..
draw,
the,
are,
null,
handle,
box,
red,
rectangle,
...
DFile
Identifier:
drawXORRect

EMPIRI
CAL
E V A L U
A T I O N
RESEARCH
QUESTIONS
• RQ1:
How does LINSEN compare with state-of-the-art approaches as for the splitting of
identiﬁers?

EMPIRI
CAL
E V A L U
A T I O N
RESEARCH
QUESTIONS
• RQ1:
identiﬁers?
• RQ2:
How does LINSEN compare with state-of-the-art approaches as for the mapping
of identiﬁers to dictionary words?

EMPIRI
CAL
E V A L U
A T I O N
RESEARCH
QUESTIONS
• RQ1:
identiﬁers?
• RQ2:
How does LINSEN compare with state-of-the-art approaches as for the mapping
of identiﬁers to dictionary words?
• RQ3:
What is the ability of the LINSEN approach in dealing with different types of
abbreviations?

CASE STUDIES
DTW (Madani et. al 2010)
GenTest+Normalize (Lawrie and Binkley, 2011)
RQ1 and
RQ2}
RQ1 only
LUDISO Dataset
(2012)
AMAP (Hill and Pollock, 2008)
RQ3 only
15 out of 750 software systems
Covering the 58% of total identifiers
EMPIRI
CAL
E V A L U
A T I O N

EVALUATION METRICS
Comparability of Results:
Accuracy rate
Qualitative Evaluation:
Precision/Recall/F-1
[Guerrouj, et.al , 2011]
EMPIRI
CAL
E V A L U
A T I O N

EVALUATION METRICS
Comparability of Results:
Accuracy rate
• Identifier Level evaluation:
Each mapping result must be completely
correct
• Soft-word Level evaluation:
“Partial credit” given to each word correctly
mapped
Qualitative Evaluation:
Precision/Recall/F-1
[Guerrouj, et.al , 2011]
As for the comparison with
GenTest+Normalize
(Lawrie and Binkley, 2011)
EMPIRI
CAL
E V A L U
A T I O N

RQ1: SPLITTING
Accuracy Rates for the
comparison with
0
0.25
0.5
0.75
1
JhotDraw 5.1 Lynx 2.8.5
DTW LINSEN DTW LINSEN
DTW
LINSEN
RE
SULTS
R Q 1

comparison with GenTest
0
0.175
0.35
0.525
0.7
which 2.20 a2ps 4.14
Identifier Level
0
0.2
0.4
0.6
0.8
Soft-word Level
GenTest
LINSEN
GenTest
LINSEN
RE
SULTS
R Q 1 RQ1: SPLITTING

comparison with
0
0.25
0.5
0.75
1
JhotDraw 5.1 Lynx 2.8.5
DTW
LINSEN
RQ2: MAPPING
RE
SULTS
R Q 2

comparison with Normalize
0
0.15
0.3
0.45
0.6
Identifier Level
0
0.225
0.45
0.675
0.9
Soft-word Level
Normalize
LINSEN
Normalize
LINSEN
RE
SULTS
R Q 2 RQ2: MAPPING

comparison with
AMAP (Hill and Pollock, 2008)
0
0.225
0.45
0.675
0.9
CW DL OO AC PR SL
AMAP
LINSEN
RQ3: EXPANSION
RE
SULTS
R Q 3
CW: Combination Words
DL: Dropped Letters
OO: Others
AC: Acronyms
PR: Prefix
SL: Single Letters

FUTURE WORKS
• Evaluation of the impact of each adopted dictionary on
the performance
• Improve or change or add dictionaries
• Improve the implementation of the prototype to speed up
the computation
• Make use of parallel computation to process each
identifier in isolation

THANK YOU
26th Sept. 2012, ICSM2012@Riva del Garda(Trento), Italy
Valerio Maggio
Ph.D. Student, University of Naples “Federico II”
valerio.maggio@unina.it

LINSEN an efficient approach to split identifiers and expand abbreviations

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Mehr von Valerio Maggio

Mehr von Valerio Maggio (8)

Kürzlich hochgeladen

Kürzlich hochgeladen (20)

LINSEN an efficient approach to split identifiers and expand abbreviations