Paper: Expanding Identifiers to Normalize Source Code Vocabulary
Authors: Dave Binkley and Dawn Lawrie
Session: Research Track 4: Natural Language Analysis
Natural Language Analysis - Expanding Identifiers to Normalize Source Code Vocabulary
1. EXPANDING IDENTIFIERS TO
NORMALIZING SOURCE
CODE VOCABULARY
PRESENTED BY DAWN LAWRIE
LOYOLA UNIVERSITY MARYLAND
IN COLLABORATION WITH DAVE BINKLEY
Friday, October 7, 11
2. VOCABULARY MISMATCH
DIFFERENT VOCABULARY IN SOURCE CODE AND OTHER
SOFTWARE ARTIFACTS
EXAMPLE
REQUIREMENT - “FEATURE LOCATION”
SOURCE CODE - “FEATURELOCATION”
OR WORSE “FLOC”
Friday, October 7, 11
3. PURPOSE OF NORMALIZE
COPE WITH VOCABULARY MISMATCH
SOURCE CODE
OTHER SOFTWARE DOCUMENTS
Friday, October 7, 11
4. EXAMPLE PROBLEMS
CONSIDER IDENTIFIERS
FEATURELOCATION
FLOC
Friday, October 7, 11
5. EXAMPLE PROBLEMS
CONSIDER IDENTIFIERS
FEATURE LOCATION SPLITTING PROBLEM
FLOC
Friday, October 7, 11
6. EXAMPLE PROBLEMS
CONSIDER IDENTIFIERS
FEATURE LOCATION SPLITTING PROBLEM
F LOC SPLITTING PROBLEM
Friday, October 7, 11
7. EXAMPLE PROBLEMS
CONSIDER IDENTIFIERS
FEATURE LOCATION SPLITTING PROBLEM
FEATURE LOCATION SPLITTING AND
EXPANSION PROBLEM
Friday, October 7, 11
8. WHY NORMALIZE?
MANY SE PROBLEMS CAN BE ADDRESSED USING
INFORMATION RETRIEVAL (IR) TECHNIQUES
UN-NORMALIZED CODE LEADS TO AN UNDER
ESTIMATE OF THE IMPORTANCE OF CRUCIAL WORDS
Friday, October 7, 11
9. NORMALIZE PROBLEM STATEMENT
FIND THE BEST EXPANSION OVERALL POSSIBLE SPLITS
FLOC FEATURE LOCATION
Friday, October 7, 11
27. NORMALIZE ALGORITHM PART I
STR
VS
LENDER LENDER
STRING STEER
LENGTH LENGTH
Friday, October 7, 11
28. NORMALIZE ALGORITHM PART I
STR
VS
LENDER LENDER
STRING STEER
LENGTH LENGTH
1. FIND COHESION BY SUMMING LOG OF
PROBABILITIES OF WORD PAIRS
Friday, October 7, 11
29. NORMALIZE ALGORITHM PART I
STR
VS
LENDER LENDER
STRING STEER
+ LENGTH + LENGTH
COHESIONA COHESIONB
1. FIND COHESION BY SUMMING LOG OF
PROBABILITIES OF WORD PAIRS
Friday, October 7, 11
30. NORMALIZE ALGORITHM PART I
STR
VS
LENDER LENDER
STRING STEER
+ LENGTH + LENGTH
COHESIONA COHESIONB
1. FIND COHESION BY SUMMING LOG OF
PROBABILITIES OF WORD PAIRS
2. SELECT EXPANSION THAT MAXIMIZES
COHESION
Friday, October 7, 11
31. NORMALIZE ALGORITHM PART I
STR
VS
LENDER LENDER
STRING STEER
+ LENGTH + LENGTH
COHESIONA COHESIONB
1. FIND COHESION BY SUMMING LOG OF
PROBABILITIES OF WORD PAIRS
2. SELECT EXPANSION THAT MAXIMIZES
COHESION
Friday, October 7, 11
32. NORMALIZE ALGORITHM PART I
STR
VS
LENDER LENDER
STRING STEER
+ LENGTH + LENGTH
COHESIONA COHESIONB
STRING
1. FIND COHESION BY SUMMING LOG OF
PROBABILITIES OF WORD PAIRS
2. SELECT EXPANSION THAT MAXIMIZES
COHESION
Friday, October 7, 11
34. NORMALIZE ALGORITHM PART II
VS
STR-LEN ST-RLEN
STRING LENGTH STOP RIFLEMEN
Friday, October 7, 11
35. NORMALIZE ALGORITHM PART II
VS
STR-LEN ST-RLEN
STRING LENGTH STOP RIFLEMEN
1. FIND COHESION OVER EXPANSIONS
Friday, October 7, 11
36. NORMALIZE ALGORITHM PART II
VS
STR-LEN ST-RLEN
STRING LENGTH STOP RIFLEMEN
1. FIND COHESION OVER EXPANSIONS
2. SELECT EXPANSION OF THE SPLIT
THAT MAXIMIZES COHESION
Friday, October 7, 11
37. NORMALIZE ALGORITHM PART II
VS
STR-LEN ST-RLEN
STRING LENGTH STOP RIFLEMEN
1. FIND COHESION OVER EXPANSIONS
2. SELECT EXPANSION OF THE SPLIT
THAT MAXIMIZES COHESION
Friday, October 7, 11
38. NORMALIZE ALGORITHM PART II
VS
STR-LEN ST-RLEN
STRING LENGTH STOP RIFLEMEN
STRING LENGTH
1. FIND COHESION OVER EXPANSIONS
2. SELECT EXPANSION OF THE SPLIT
THAT MAXIMIZES COHESION
Friday, October 7, 11
41. ADDING CONTEXT
DIR E(DIR) = {DIRECTION, DIRECTORY}
Friday, October 7, 11
42. ADDING CONTEXT
DIR E(DIR) = {DIRECTION, DIRECTORY}
CONTEXT = {FORWARD, BACKWARD}
Friday, October 7, 11
43. ADDING CONTEXT
DIR E(DIR) = {DIRECTION, DIRECTORY}
CONTEXT = {FORWARD, BACKWARD}
FIND COHESION WITH CONTEXT WORDS IN ADDITION TO
EXPANSIONS OF OTHER SOFT WORDS
USED IN BOTH PART 1 AND PART 2
Friday, October 7, 11
44. NORMALIZE IMPLEMENTATION
USES GenTest TO SPLIT IDENTIFIERS
RETURNS MULTIPLE SPLITS
GOOGLE 5-GRAM DATASET
Friday, October 7, 11
45. EVALUATION
Program Loc SLoc Unique Ids
which-2.20 3,670 2,293 487
a2ps-4.14 62,347 38,436 4,393
Program Selected Ids Hard Words Soft Words
which-2.20 487 903 1214
a2ps-4.14 211 459 618
Friday, October 7, 11
46. EVALUATION
THREE GROUPS OF IDENTIFIERS
STANDARD LIBRARY CALLS
NAMES FROM STANDARD HEADER FILES / KEYWORDS
DOMAIN NAMES
Friday, October 7, 11
47. EVALUATION
THREE GROUPS OF IDENTIFIERS
STANDARD LIBRARY CALLS
NAMES FROM STANDARD HEADER FILES / KEYWORDS
DOMAIN NAMES
Friday, October 7, 11
48. EVALUATION
THREE GROUPS OF IDENTIFIERS
STANDARD LIBRARY CALLS
NAMES FROM STANDARD HEADER FILES / KEYWORDS
DOMAIN NAMES
Program Filtered Ids Reported Ids
which-2.20 152 335
a2ps-4.14 46 166
Friday, October 7, 11
49. EXAMPLE EXPANSIONS
id Top 10 Top Expansion
Expansion
nextchar next_character next_character
indfound index_found_need index_found
optarg option_are_g optarg
itemno i_them_not itemno
Friday, October 7, 11
50. RESEARCH QUESTIONS
WHAT IS THE OVERALL ACCURACY OF NORMALIZE?
DOES THE VOCABULARY USED HAVE A SIGNIFICANT
IMPACT ON THE EXPANSION’S ACCURACY?
CAN THE EXPANDER INFORM THE SPLITTER?
CAN THE SPLITTER INFORM THE EXPANDER?
Friday, October 7, 11
54. FUTURE WORK
EXPLORING DIFFERENT SOURCES OF CO-OCCURRENCE
DATA
EXPLORING DIFFERENT WAYS OF CALCULATING
PROBABILITIES
EXAMINING NORMALIZATION IN CONTEXT OF AN
INFORMATION RETRIEVAL TASK
Friday, October 7, 11
55. SUMMARY
IDENTIFIERS ARE WRITTEN DIFFERENTLY THAN OTHER
SOFTWARE DOCUMENTS
DEGRADES PERFORMANCE OF IR TECHNIQUES
NORMALIZE CURRENTLY EXPANDS ABOUT HALF OF
SOFT WORDS CORRECTLY
Friday, October 7, 11
56. QUESTIONS?
Need an identifier split?
GenTest Splitter available at
splitit.cs.loyola.edu
Friday, October 7, 11