"Linsen an efficient approach to split identifiers and expand abbreviations"
Slides presented at the International Conference of Software Maintenance (ICSM) 2012, Riva del Garda (TN), Italy
LINSEN an efficient approach to split identifiers and expand abbreviations
1. LINSEN
AN EFFICIENT APPROACH
TO SPLIT IDENTIFIERS AND
EXPAND ABBREVIATIONS
Anna Corazza, Sergio Di Martino, Valerio Maggio
Università di Napoli “Federico II”
26th Sept. 2012, ICSM2012@Riva del Garda(Trento), Italy
17. • camelCase Splitter: r’(?<=!^)([A-Z][a-z]+)’
• drawXORRect ==> drawXOR | Rect
• drawxorrect ==> NO SPLIT
IRFORSOURCECODE
18. • camelCase Splitter: r’(?<=!^)([A-Z][a-z]+)’
• drawXORRect ==> drawXOR | Rect
• drawxorrect ==> NO SPLIT
IRFORSOURCECODE Splitting algorithms based on naming conventions
are not robust enough
19. Splitting algorithms based on naming conventions
are not robust enough
• Heavy use of Abbreviations in the source code
IRFORSOURCECODE
20. Splitting algorithms based on naming conventions
are not robust enough
• Heavy use of Abbreviations in the source code
• rect as for Rectangle
• r as for Rectangle
IRFORSOURCECODE
21. Splitting algorithms based on naming conventions
are not robust enough
• Heavy use of Abbreviations in the source code
• rect as for Rectangle
• r as for Rectangle
IRFORSOURCECODE
23. LINSEN
A L G O
R I T H M CONTRIBUTION
• Novel technique for the Identifier Mapping
24. LINSEN
A L G O
R I T H M CONTRIBUTION
• Novel technique for the Identifier Mapping
• Based on an efficient String Matching technique:
Baeza-Yates&Perlberg Algorithm (BYP)
25. LINSEN
A L G O
R I T H M CONTRIBUTION
• Novel technique for the Identifier Mapping
• Based on an efficient String Matching technique:
Baeza-Yates&Perlberg Algorithm (BYP)
• Applied on a Graph-based model
26. LINSEN
A L G O
R I T H M CONTRIBUTION
• Novel technique for the Identifier Mapping
• Based on an efficient String Matching technique:
Baeza-Yates&Perlberg Algorithm (BYP)
• Applied on a Graph-based model
• Able to both Split Identifiers and Expand possible occurring
abbreviations
27. There could be multiple and equally correct splitting or
expansion solutions
THEAMBIGUITYPROBLEM
28. There could be multiple and equally correct splitting or
expansion solutions
• r as for Rectangle OR red
THEAMBIGUITYPROBLEM
29. There could be multiple and equally correct splitting or
expansion solutions
• r as for Rectangle OR red
THEAMBIGUITYPROBLEM
•nsISupport ==> ns|IS|up|ports OR
==> ns|I|Supports
37. GRAPHMODEL
• NODES correspond to characters of the current identifier
Model: Weighted Directed Graph Example: drawXORRect identifier
0
11
1
9
4
5
7
8
2 63
10
38. GRAPHMODEL
• ARCS corresponds to matchings between identifier substrings
and dictionary words
• NODES correspond to characters of the current identifier
Model: Weighted Directed Graph Example: drawXORRect identifier
0
11
1
9
4
5
7
8
2 63
10
“raw”,c(“raw”)
“draw”,c(“draw”)
“or”,c(“or”)
“xor”,c(“xor”)
“rectangle”,c(“rectangle”)
39. GRAPHMODEL
• ARCS corresponds to matchings between identifier substrings
and dictionary words
• Application of the String Matching Algorithm (BYP)
• NODES correspond to characters of the current identifier
Model: Weighted Directed Graph Example: drawXORRect identifier
0
11
1
9
4
5
7
8
2 63
10
“raw”,c(“raw”)
“draw”,c(“draw”)
“or”,c(“or”)
“xor”,c(“xor”)
“rectangle”,c(“rectangle”)
40. GRAPHMODEL
• ARCS corresponds to matchings between identifier substrings
and dictionary words
• Application of the String Matching Algorithm (BYP)
• Padding Arcs to ensure the Graph always connected
• NODES correspond to characters of the current identifier
Model: Weighted Directed Graph Example: drawXORRect identifier
0
11
1
9
4
“r”,C-MAX
“d”,C-MAX
5
7
8
2
“R”,C-MAX
6
“a”,C-MAX
3
“w”,C-MAX
“X”,C-MAX
“O”,C-MAX
“e”,C-MAX
10
“c”,C-MAX“t”,C-MAX
“raw”,c(“raw”)
“draw”,c(“draw”)
“or”,c(“or”)
“xor”,c(“xor”)
“rectangle”,c(“rectangle”)
“R”,C-MAX
41. GRAPHMODEL
• Every Arc is Labelled with the corresponding dictionary word
• Weights represent the “cost” of each matching
• Cost function [c(“word”)] favors longest words and words coming
from the application-aware dictionaries
Model: Weighted Directed Graph Example: drawXORRect identifier
0
11
1
9
4
“r”,C-MAX
“d”,C-MAX
5
7
8
2
“R”,C-MAX
6
“a”,C-MAX
3
“w”,C-MAX
“X”,C-MAX
“O”,C-MAX
“e”,C-MAX
10
“c”,C-MAX“t”,C-MAX
“raw”,c(“raw”)
“draw”,c(“draw”)
“or”,c(“or”)
“xor”,c(“xor”)
“rectangle”,c(“rectangle”)
“R”,C-MAX
42. GRAPHMODEL
• The final Mapping Solution corresponds to the sequence of labels in the
path with the minimum cost (Djikstra Algorithm)
Model: Weighted Directed Graph Example: drawXORRect identifier
0
11
1
9
4
“r”,C-MAX
“d”,C-MAX
5
7
8
2
“R”,C-MAX
6
“a”,C-MAX
3
“w”,C-MAX
“X”,C-MAX
“O”,C-MAX
“e”,C-MAX
10
“c”,C-MAX“t”,C-MAX
“raw”,c(“raw”)
“draw”,c(“draw”)
“or”,c(“or”)
“xor”,c(“xor”)
“rectangle”,c(“rectangle”)
“R”,C-MAX
43. STRING MATCHING
• Application of the Baeza-Yates and Perlberg (BYP) Algorithm
• Signature: BYP(identifier, word, φ(word))
• identifier: target string
• word: string to match
• φ(·): Tolerance (Error) function
• Bounds the length of acceptable matchings
Advantage: Use the same algorithm for both the splitting and
the expansion step with different input Tolerance function
57. EMPIRI
CAL
E V A L U
A T I O N
RESEARCH
QUESTIONS
• RQ1:
How does LINSEN compare with state-of-the-art approaches as for the splitting of
identifiers?
58. EMPIRI
CAL
E V A L U
A T I O N
RESEARCH
QUESTIONS
• RQ1:
How does LINSEN compare with state-of-the-art approaches as for the splitting of
identifiers?
• RQ2:
How does LINSEN compare with state-of-the-art approaches as for the mapping
of identifiers to dictionary words?
59. EMPIRI
CAL
E V A L U
A T I O N
RESEARCH
QUESTIONS
• RQ1:
How does LINSEN compare with state-of-the-art approaches as for the splitting of
identifiers?
• RQ2:
How does LINSEN compare with state-of-the-art approaches as for the mapping
of identifiers to dictionary words?
• RQ3:
What is the ability of the LINSEN approach in dealing with different types of
abbreviations?
60. CASE STUDIES
DTW (Madani et. al 2010)
GenTest+Normalize (Lawrie and Binkley, 2011)
RQ1 and
RQ2}
RQ1 only
LUDISO Dataset
(2012)
AMAP (Hill and Pollock, 2008)
RQ3 only
15 out of 750 software systems
Covering the 58% of total identifiers
EMPIRI
CAL
E V A L U
A T I O N
61. EVALUATION METRICS
Comparability of Results:
Accuracy rate
Qualitative Evaluation:
Precision/Recall/F-1
[Guerrouj, et.al , 2011]
EMPIRI
CAL
E V A L U
A T I O N
62. EVALUATION METRICS
Comparability of Results:
Accuracy rate
Qualitative Evaluation:
Precision/Recall/F-1
[Guerrouj, et.al , 2011]
EMPIRI
CAL
E V A L U
A T I O N
63. EVALUATION METRICS
Comparability of Results:
Accuracy rate
• Identifier Level evaluation:
Each mapping result must be completely
correct
• Soft-word Level evaluation:
“Partial credit” given to each word correctly
mapped
Qualitative Evaluation:
Precision/Recall/F-1
[Guerrouj, et.al , 2011]
As for the comparison with
GenTest+Normalize
(Lawrie and Binkley, 2011)
EMPIRI
CAL
E V A L U
A T I O N
64. RQ1: SPLITTING
Accuracy Rates for the
comparison with
DTW (Madani et. al 2010)
0
0.25
0.5
0.75
1
JhotDraw 5.1 Lynx 2.8.5
DTW LINSEN DTW LINSEN
DTW
LINSEN
RE
SULTS
R Q 1
65. Accuracy Rates for the
comparison with GenTest
(Lawrie and Binkley, 2011)
0
0.175
0.35
0.525
0.7
which 2.20 a2ps 4.14
Identifier Level
0
0.2
0.4
0.6
0.8
which 2.20 a2ps 4.14
Soft-word Level
GenTest
LINSEN
GenTest
LINSEN
RE
SULTS
R Q 1 RQ1: SPLITTING
66. Accuracy Rates for the
comparison with
DTW (Madani et. al 2010)
0
0.25
0.5
0.75
1
JhotDraw 5.1 Lynx 2.8.5
DTW
LINSEN
RQ2: MAPPING
RE
SULTS
R Q 2
67. Accuracy Rates for the
comparison with Normalize
(Lawrie and Binkley, 2011)
0
0.15
0.3
0.45
0.6
which 2.20 a2ps 4.14
Identifier Level
0
0.225
0.45
0.675
0.9
which 2.20 a2ps 4.14
Soft-word Level
Normalize
LINSEN
Normalize
LINSEN
RE
SULTS
R Q 2 RQ2: MAPPING
68. Accuracy Rates for the
comparison with
AMAP (Hill and Pollock, 2008)
0
0.225
0.45
0.675
0.9
CW DL OO AC PR SL
AMAP
LINSEN
RQ3: EXPANSION
RE
SULTS
R Q 3
CW: Combination Words
DL: Dropped Letters
OO: Others
AC: Acronyms
PR: Prefix
SL: Single Letters
74. FUTURE WORKS
• Evaluation of the impact of each adopted dictionary on
the performance
• Improve or change or add dictionaries
• Improve the implementation of the prototype to speed up
the computation
• Make use of parallel computation to process each
identifier in isolation
75. THANK YOU
26th Sept. 2012, ICSM2012@Riva del Garda(Trento), Italy
Valerio Maggio
Ph.D. Student, University of Naples “Federico II”
valerio.maggio@unina.it