1. On the Use of Domain Terms in
Source Code
Sonia Haiduc
Andrian Marcus
ICPC 2008
Amsterdam, The Netherlands
2. Importance of Domain Terms in
Program Comprehension
• Requirements documents, conversation
between stakeholders, etc. -> expressed using
domain terms
• Source code – representation of domain ->
domain concepts in the source code
• Using domain terms in source code - essential
for reuse -> intent of the software
• Concept/concern location, traceability, etc.
• XP -> system metaphor -> the use of the same
words to describe the same concepts is
desirable
3. What is the Code about?
void setAccount(…) void getCarPart()
{ {
int x; int x;
int y; float y;
String z; String z;
… …
} }
4. Lexical Agreement
• Furnas: 20% probability of two people choosing
the same word to express the same concept
• Summarization: 20% lexical agreement
between summaries of the same document
-> catastrophe for comprehending source code
5. Research Questions
RQ1. To what degree are domain terms found in
the source code of software from a particular
problem domain?
RQ2. Which is the preponderant source of
domain terms: identifiers or comments?
RQ3. To what level do programmers agree in
choosing domain terms across systems from
the same problem domain?
6. Case Study
• 6 graph theory libraries (2 C++, 4 Java)
• 135 domain concepts, 193 domain terms
(http://www.cs.wayne.edu/~severe/icpc2008)
8. RQ1. Degree of Domain Terms in
Source Code
• On average, 42% of the domain terms appear in the
software domain vocabulary of one library
• 77% of the domain terms were used in at least one of
the six libraries
• The size of the software domain vocabularies is
correlated with the size of the software systems and
software vocabularies
9. RQ2. Domain Terms in Identifiers
and Comments
• On average, 90% of the domain terms found in a
software library are found in comments, whereas only
78% are found in identifiers
-> comments richer source of domain terms and should
not be ignored
• 23% of domain terms found in a software library are
found only in comments, whereas only 11% are found
only in identifiers
-> comments complete the domain information when
missing from identifiers
10. RQ3. Lexical Agreement
• Software libraries – partial summaries of a
problem domain
• Pair-wise lexical agreement measure from
summarization
• Agreement of 63% between pairs of libraries
(compared to 24% in document summaries)
• 18 domain terms used in all libraries
• Up to 98% of domain terms reused between
pairs of libraries
11. Threats to Validity
• Only one domain considered
• List of domain terms and concepts manually
picked
• Verification in source code of term meaning
• Considered the software libraries as partial
summaries of a domain
12. Future Work
• More case studies, different domains
• Verification of the meaning of terms in source
code - consider word relationships to determine
the meaning of words automatically
• Analyze other software artifacts (requirement
documents, user manuals, bug reports, etc.)
13. Conclusion
• 42% of domain terms are found in the source code
-> domain ontologies constructed from the source
code will be far from complete
• Comments are a richer source of domain terms than
identifiers and contain extra domain terms
-> comments should not be ignored by tools nor by
programmers
• High lexical agreement between programmers
-> developers familiar with the domain will have an
easier time understanding source code in the same
domain, even when written by others.
-> domain-specific tools