Presented at: The 1st International Workshop on Natural Language-based Software Engineering (NLBSE ‘22)
Date of Conference: May 2022
Conference Location: Virtual
The preprint is available at: https://www.peruma.me/publication/2022-nlbse-digits/2022-nlbse-digits.pdf
A video of the presentation is available at: https://youtu.be/ERD6GTFzOxY
AI Mastery 201: Elevating Your Workflow with Advanced LLM Techniques
Understanding Digits in Identifier Names: An Exploratory Study
1. The 1st International Workshop on Natural Language-based Software Engineering (NLBSE ‘22)
Understanding Digits in Identifier Names
An Exploratory Study
Anthony Peruma and Christian D. Newman
Source Code Analysis and Natural Language Lab
2. SUMMARY
We explore the presence and
purpose of digits in identifier
names through an empirical study
of 800 open-source Java systems
01
3. BACKGROUND
Identifier names help developers understand the purpose of
the identifier
Names must be unambiguous and intent revealing in
communicating the purpose and behavior of the code
Developers can craft names using a variety of terms making
name consistency challenging
Prior studies focused on the words that make up identifiers,
not digits, such as abbreviations, acronyms, and naming styles.
02
4. OUR
GOAL
Understand the part played by digits in
identifier names by examining the
structure of names containing digits and
the semantics expressed by the digits.
03
5. IMPACT
Findings from our study facilitate research
and development of tools to aid in name
recommendation and appraisal.
04
6. RESEARCH
QUESTIONS
02
How does identifier renaming
operations in the source code
impact the existence of digits in an
identifier's name?
• Volume and characteristics
• Digit preservation
How do developers utilize digits in an
identifier's name to convey meaning?
• Qualitative examination of names
• Taxonomy for the presence of digits
01
05
10. RQ 1: The treatment of names with digits over time
09
Approach:
● Automated examination of the presence and/or absence of digits in an identifier’s
name before and after a rename operation
Findings:
● Digits are frequently preserved when renamed (e.g., node2 → node3)
○ 43.56% instances preserve digits
○ 33.29% instances remove digits
○ 23.15% instances add digits
● Digit preservation:
○ Most names contain only a single digit in the old & new name
■ 79.93% instances
○ Equal number of digits in the old & new name
■ 91.35% instances
○ The position of the digit is mostly preserved
■ 2nd position in name – 28.73% (e.g., shade2 → shade2Figure)
2 3
11. RQ 2: The meaning conveyed by digits in a name
10
Approach:
● Manual examination of 375 rename instances by the authors (stratified statistically
significant sample)
○ Includes reviewing the surrounding code
○ Snowballing to locate examples of additional instances in the original dataset
Findings:
● Taxonomy of 6 categories showing how digits convey meaning in a name:
○ Auto-Generated
○ Distinguisher
○ Synonym
○ Version Number
○ Specification
○ Domain/Technology
12. RQ 2: The meaning conveyed by digits in a name
10
Auto-Generated
• Created by a code generation tool,
or IDE; not easily comprehensible
• Numbers may have a meaning
based on the generation technique
• E.g., LA18_6
Distinguisher
• Usually the last token in the name
• At least two identifiers having a
lexically identical name
• Avoid name collision at compilation
• E.g., auditLog3
Synonym
• At least one digit utilized in place of
a word
• The numbers 2 and 4 are very
common example
• E.g., convert2RList
Domain/Technology
• The digit that is part of the name of
a domain term or technology
• Digits themselves have no
individual meaning
• E.g., slf4jLogLevel
Specification
• Represents a specification
• Acts as a way to uniquely identify
concepts, behaviors, or
characteristics.
• E.g., arialRegular9Dark
Version Number
• Digit used to signify a version
number
• Indicates significant capabilities and
limitations or the identifier
• E.g., V1DozerTransformModel
14. KEY CHALLENGES
● Auto-generated code can skew findings;
○ Most likely you will need to run multiple iterations of your data
collection/extraction process to isolate auto-generated identifiers
● The volume of auto-generated names can hinder data sampling activities as they
may comprise of the majority of identifiers in the code
○ This can vary depending on the type of auto-generated code the project utilizes
● Automatically detecting auto-generated identifiers is not straightforward
○ Heuristics can help, but only partly and is human dependent
● Numbers can have different interpretations; with limited research in this specific
area, we don't know if numbers hinder or help comprehension
15. KEY TAKEAWAYS
12
Digits Are Preserved
Post-Rename
01
The Digits Found in
Identifier Names Are
Meaningful
02
Improve identifier name appraisals and
recommendations when developers
perform rename operations
Utilize static analysis to determine if the
digit is related to the code
Build a catalog of technologies,
standards, or domain terms in the project
16. IDENTIFIER NAMING STRUCTURE CATALOGUE
A resource about what is scientifically known about naming identifiers
13
Part-of-Speech Tagset Linguistic Terminology
Linguistic Antipatterns Common Naming Structures
Naming Styles
Available at: h t t p s : / / w w w . s c a n l . o r g