1. Exploring the Influence of Identifier Names
on Code Quality:
an empirical study
Simon Butler, Michel Wermelinger, Yijun Yu and Helen Sharp
Centre for Research in Computing
The Open University, UK
CSMR, Madrid, 18 March 2010
Centre for
Research in Computing
Simon Butler et al. (Open Univ., UK) The Influence of Identifiers on Code Quality CSMR’10 1 / 13
2. Introduction
Identifier names
primary source of concepts in source code
crucial to program comprehension and readability
reflect cognitive processes
A wider influence?
connection between readability and defects (Buse & Weimer)
Research Question
‘What is the influence of identifier name quality
on source code quality?’
Simon Butler et al. (Open Univ., UK) The Influence of Identifiers on Code Quality CSMR’10 2 / 13
3. Evaluating Identifier Name Quality
Relf’s Identifier Naming Style Guidelines
21 guidelines for Ada & Java
evaluated empirically
focus on typography of names
simple approach to use of natural language
Applying the Guidelines
adapted 9 guidelines as naming flaw indicators
length: too few/many words/characters
typographical conventions: capitalization, type encoding
natural language: English and extended dictionaries
Simon Butler et al. (Open Univ., UK) The Influence of Identifiers on Code Quality CSMR’10 3 / 13
4. Evaluating Code Quality
Static analysis
FindBugs
Java specific static analysis tool
Identifies a range of priority 1 and 2 bug patterns
Google: most identified issues required correction
Metrics
Readability
human-trained layout metric (Buse & Weimer)
Cyclomatic Complexity
to measure branching complexity
Maintainability Index
based on LOC, cyclomatic complexity, Halstead volume (Welker et al.)
Simon Butler et al. (Open Univ., UK) The Influence of Identifiers on Code Quality CSMR’10 4 / 13
5. Methodology
Data Collection
8 mature FLOSS Java projects from different domains
each with 1-12 thousand methods
computed metrics and extracted names from source code
ran FindBugs on corresponding bytecode
Simon Butler et al. (Open Univ., UK) The Influence of Identifiers on Code Quality CSMR’10 5 / 13
6. Methodology
Naming Quality
Names split into hard words on typographical boundaries
NullPointerException is split into {Null, Pointer, Exception}
MOUSE EVENT MASK is split into {MOUSE, EVENT, MASK}
Extended dictionaries created with unrecognised hard words
built dictionaries for words used in 3, 5 or 10 unique identifiers
Identifier names analysed for compliance with each guideline
Code Quality
binary classification of methods into
with/without FindBugs priority 1 (or 2) warnings
readability below/above 0.5
cyclomatic complexity below/above 6 (or 10)
maintainability index below/above 65
Simon Butler et al. (Open Univ., UK) The Influence of Identifiers on Code Quality CSMR’10 6 / 13
7. Statistical Analysis
Null hypothesis: independent distributions
χ2 test applied to assess independence of identifier flaws and:
FindBugs warnings
less readable methods
less maintainable methods
less complex methods
null hypothesis was rejected if p < 5%
Guidelines as classifiers?
Applied diagnostic test evaluation used in medicine
Compared each guideline vs reference classifiers
JFreeChart FindBugs Priority Two Warnings
Non-Dictionary Words methods with methods with- sensitivity = 103 ÷ (103 + 37) = 0.74
out specificity = 5165 ÷ (2925 + 5165) = 0.64
AUC = 0.69
methods with 103 2925
methods without 37 5165
Simon Butler et al. (Open Univ., UK) The Influence of Identifiers on Code Quality CSMR’10 7 / 13
9. Identifier flaws and FindBugs priority 2 warnings
JasperReports
JFreeChart
Hibernate
Freemind
Tomcat
Cactus
jEdit
Ant
Capitalisation Anomaly .62 .62 – – .57
Excessive Words .55 .55 .58 –
External Underscores * * * *
Long Identifier .59 .57 –
Naming Convention Anomaly
Number of Words .56 .59 – .55 .55
Numeric Identifier * * * *
Short Identifier Name .56 .58 .62 – .56 .57
Type Encoding * * *
Non-Dictionary Words .60 .64 .62 – .63 .69 .59
Extended 3 .64 .66 .59 .63 .59
Extended 5 .64 .65 .64 – .63 .72 .59
Extended 10 .63 .64 .64 – .61 .72 .61
Less-readable .67 .67 .67 – .66 .68
p < 0.001 p < 0.05
p >= 0.05 * No flaw
Simon Butler et al. (Open Univ., UK) The Influence of Identifiers on Code Quality CSMR’10 9 / 13
10. Identifier flaws and Cyclomatic Complexity >= 10
JasperReports
JFreeChart
Hibernate
Freemind
Tomcat
Cactus
jEdit
Ant
Capitalisation Anomaly .67 .72 .63 .64 .66 .61 .73 .75
Excessive Words .55 .55 .58 .65 .58 .60
External Underscores * * * *
Long Identifier .56 .57 .68 .66 .58 .57
Naming Convention Anomaly .55
Number of Words .55 .61 .57 .60 .64 .58 .59
Numeric Identifier * * * *
Short Identifier Name .63 .65 .57 .62 .62 .55 .60 .62
Type Encoding * * *
Non-Dictionary Words .67 .70 .67 .74 .70 .64 .78 .76
Extended 3 .69 .70 .61 .73 .68 .64 .75 .75
Extended 5 .70 .69 .65 .75 .73 .66 .82 .76
Extended 10 .70 .70 .66 .76 .74 .66 .81 .77
p < 0.001 p < 0.05
p >= 0.05 * No flaw
Simon Butler et al. (Open Univ., UK) The Influence of Identifiers on Code Quality CSMR’10 10 / 13
11. Identifier flaws and Less-Readable methods
JasperReports
JFreeChart
Hibernate
Freemind
Tomcat
Cactus
jEdit
Ant
Capitalisation Anomaly .62 .55 .61 .60 .62 .62 .63 .66
Excessive Words .59 .58 .61 .57
External Underscores * * * *
Long Identifier .56 .58 .60 .58 .56 .56
Naming Convention Anomaly
Number of Words .56 .60 .55
Numeric Identifier * * * *
Short Identifier Name .57
Type Encoding * * *
Non-Dictionary Words .65 .56 .61 .66 .65 .65 .62 .68
Extended 3 .62 .56 .58 .62 .60 .65
Extended 5 .64 .57 .60 .63 .63 .66
Extended 10 .65 .56 .58 .63 .65 .63 .68
p < 0.001 p < 0.05
p >= 0.05 * No flaw
Simon Butler et al. (Open Univ., UK) The Influence of Identifiers on Code Quality CSMR’10 11 / 13
12. Identifier flaws and Less-Maintainable methods
JasperReports
JFreeChart
Hibernate
Freemind
Tomcat
Cactus
jEdit
Ant
Capitalisation Anomaly .78 .78 .76 .67 .67 .64 .81 .77
Excessive Words .59 .58 .67 .68 .62 .57 .63 .55
External Underscores * * * .57 *
Long Identifier .57 .68 .67 .73 .71 .57 .61 .58
Naming Convention Anomaly .55 .57 .56 .55
Number of Words .57 .61 .62 .62 .65 .56 .59 .60
Numeric Identifier * * * *
Short Identifier Name .59 .65 .62 .65 .66 .56 .61 .63
Type Encoding * * *
Non-Dictionary Words .76 .77 .79 .82 .72 .72 .80 .78
Extended 3 .81 .76 .69 .83 .72 .71 .84 .80
Extended 5 .82 .76 .75 .85 .78 .74 .85 .80
Extended 10 .80 .77 .77 .85 .80 .74 .84 .80
p < 0.001 p < 0.05
p >= 0.05 * No flaw
Simon Butler et al. (Open Univ., UK) The Influence of Identifiers on Code Quality CSMR’10 12 / 13
13. Conclusions
We found:
Poor quality identifier names are associated with:
more complex
less readable
less maintainable
potentially more buggy code
Natural language content of identifier names is a classifier for source
code quality
Identifier name length is a classifier for complexity and maintainability
Opposite associations only in commercialised projects suggesting
differences between open source and commercial code
Simon Butler et al. (Open Univ., UK) The Influence of Identifiers on Code Quality CSMR’10 13 / 13