SlideShare ist ein Scribd-Unternehmen logo
1 von 13
Downloaden Sie, um offline zu lesen
On the Use of Domain Terms in
        Source Code

            Sonia Haiduc
           Andrian Marcus


               ICPC 2008
       Amsterdam, The Netherlands
Importance of Domain Terms in
        Program Comprehension


• Requirements documents, conversation
  between stakeholders, etc. -> expressed using
  domain terms
• Source code – representation of domain ->
  domain concepts in the source code
• Using domain terms in source code - essential
  for reuse -> intent of the software
• Concept/concern location, traceability, etc.
• XP -> system metaphor -> the use of the same
  words to describe the same concepts is
  desirable
What is the Code about?

void setAccount(…)   void getCarPart()
{                    {
  int x;               int x;
  int y;               float y;
  String z;            String z;
  …                    …
}                    }
Lexical Agreement


• Furnas: 20% probability of two people choosing
  the same word to express the same concept

• Summarization: 20% lexical agreement
  between summaries of the same document

-> catastrophe for comprehending source code
Research Questions

RQ1. To what degree are domain terms found in
 the source code of software from a particular
 problem domain?

RQ2. Which is the preponderant source of
 domain terms: identifiers or comments?

RQ3. To what level do programmers agree in
 choosing domain terms across systems from
 the same problem domain?
Case Study

• 6 graph theory libraries (2 C++, 4 Java)




• 135 domain concepts, 193 domain terms
  (http://www.cs.wayne.edu/~severe/icpc2008)
Lexica


• Domain vocabulary
• Software vocabulary
• Software domain vocabulary

• Filtering and stemming
RQ1. Degree of Domain Terms in
           Source Code

• On average, 42% of the domain terms appear in the
  software domain vocabulary of one library

• 77% of the domain terms were used in at least one of
  the six libraries

• The size of the software domain vocabularies is
  correlated with the size of the software systems and
  software vocabularies
RQ2. Domain Terms in Identifiers
        and Comments

• On average, 90% of the domain terms found in a
  software library are found in comments, whereas only
  78% are found in identifiers
  -> comments richer source of domain terms and should
  not be ignored

• 23% of domain terms found in a software library are
  found only in comments, whereas only 11% are found
  only in identifiers
  -> comments complete the domain information when
  missing from identifiers
RQ3. Lexical Agreement


• Software libraries – partial summaries of a
  problem domain
• Pair-wise lexical agreement measure from
  summarization
• Agreement of 63% between pairs of libraries
  (compared to 24% in document summaries)
• 18 domain terms used in all libraries
• Up to 98% of domain terms reused between
  pairs of libraries
Threats to Validity


• Only one domain considered
• List of domain terms and concepts manually
  picked
• Verification in source code of term meaning
• Considered the software libraries as partial
  summaries of a domain
Future Work


• More case studies, different domains
• Verification of the meaning of terms in source
  code - consider word relationships to determine
  the meaning of words automatically
• Analyze other software artifacts (requirement
  documents, user manuals, bug reports, etc.)
Conclusion

• 42% of domain terms are found in the source code
     -> domain ontologies constructed from the source
     code will be far from complete

• Comments are a richer source of domain terms than
  identifiers and contain extra domain terms
      -> comments should not be ignored by tools nor by
      programmers

• High lexical agreement between programmers
      -> developers familiar with the domain will have an
      easier time understanding source code in the same
      domain, even when written by others.
      -> domain-specific tools

Weitere ähnliche Inhalte

Ähnlich wie On the Use of Domain Terms in Source Code

Domain-Specific Software Engineering
Domain-Specific Software EngineeringDomain-Specific Software Engineering
Domain-Specific Software Engineering
elliando dias
 
What Do Developers Discuss about Code Comments?
What Do Developers Discuss about Code Comments?What Do Developers Discuss about Code Comments?
What Do Developers Discuss about Code Comments?
Pooja Rani
 

Ähnlich wie On the Use of Domain Terms in Source Code (20)

Code Inspection
Code InspectionCode Inspection
Code Inspection
 
Presentation1
Presentation1Presentation1
Presentation1
 
Presentation1
Presentation1Presentation1
Presentation1
 
How Do I Refactor This? An Empirical Study on Refactoring Trends and Topics i...
How Do I Refactor This? An Empirical Study on Refactoring Trends and Topics i...How Do I Refactor This? An Empirical Study on Refactoring Trends and Topics i...
How Do I Refactor This? An Empirical Study on Refactoring Trends and Topics i...
 
"Node.js Development in 2024: trends and tools", Nikita Galkin
"Node.js Development in 2024: trends and tools", Nikita Galkin "Node.js Development in 2024: trends and tools", Nikita Galkin
"Node.js Development in 2024: trends and tools", Nikita Galkin
 
130817 latifa guerrouj - context-aware source code vocabulary normalization...
130817   latifa guerrouj - context-aware source code vocabulary normalization...130817   latifa guerrouj - context-aware source code vocabulary normalization...
130817 latifa guerrouj - context-aware source code vocabulary normalization...
 
Generative Software Development. Overview and Examples
Generative Software Development. Overview and ExamplesGenerative Software Development. Overview and Examples
Generative Software Development. Overview and Examples
 
Domain-Specific Software Engineering
Domain-Specific Software EngineeringDomain-Specific Software Engineering
Domain-Specific Software Engineering
 
Coding standards
Coding standardsCoding standards
Coding standards
 
The Fluent Interface Pattern
The Fluent Interface PatternThe Fluent Interface Pattern
The Fluent Interface Pattern
 
Msr17a.ppt
Msr17a.pptMsr17a.ppt
Msr17a.ppt
 
Msr17a.ppt
Msr17a.pptMsr17a.ppt
Msr17a.ppt
 
What Do Developers Discuss about Code Comments?
What Do Developers Discuss about Code Comments?What Do Developers Discuss about Code Comments?
What Do Developers Discuss about Code Comments?
 
Intro to Programming Lang.pptx
Intro to Programming Lang.pptxIntro to Programming Lang.pptx
Intro to Programming Lang.pptx
 
Voice Enabled Desktop Interaction and Control System (VEDICS).
Voice Enabled Desktop Interaction and Control System (VEDICS).Voice Enabled Desktop Interaction and Control System (VEDICS).
Voice Enabled Desktop Interaction and Control System (VEDICS).
 
Stack overflow code_laundering
Stack overflow code_launderingStack overflow code_laundering
Stack overflow code_laundering
 
Towards Reusable Research Software
Towards Reusable Research SoftwareTowards Reusable Research Software
Towards Reusable Research Software
 
computer-science_engineering_principles-of-programming-languages_introduction...
computer-science_engineering_principles-of-programming-languages_introduction...computer-science_engineering_principles-of-programming-languages_introduction...
computer-science_engineering_principles-of-programming-languages_introduction...
 
The big DAM debate: Open source VS. proprietary software
The big DAM debate: Open source VS. proprietary softwareThe big DAM debate: Open source VS. proprietary software
The big DAM debate: Open source VS. proprietary software
 
Compiler Construction
Compiler ConstructionCompiler Construction
Compiler Construction
 

On the Use of Domain Terms in Source Code

  • 1. On the Use of Domain Terms in Source Code Sonia Haiduc Andrian Marcus ICPC 2008 Amsterdam, The Netherlands
  • 2. Importance of Domain Terms in Program Comprehension • Requirements documents, conversation between stakeholders, etc. -> expressed using domain terms • Source code – representation of domain -> domain concepts in the source code • Using domain terms in source code - essential for reuse -> intent of the software • Concept/concern location, traceability, etc. • XP -> system metaphor -> the use of the same words to describe the same concepts is desirable
  • 3. What is the Code about? void setAccount(…) void getCarPart() { { int x; int x; int y; float y; String z; String z; … … } }
  • 4. Lexical Agreement • Furnas: 20% probability of two people choosing the same word to express the same concept • Summarization: 20% lexical agreement between summaries of the same document -> catastrophe for comprehending source code
  • 5. Research Questions RQ1. To what degree are domain terms found in the source code of software from a particular problem domain? RQ2. Which is the preponderant source of domain terms: identifiers or comments? RQ3. To what level do programmers agree in choosing domain terms across systems from the same problem domain?
  • 6. Case Study • 6 graph theory libraries (2 C++, 4 Java) • 135 domain concepts, 193 domain terms (http://www.cs.wayne.edu/~severe/icpc2008)
  • 7. Lexica • Domain vocabulary • Software vocabulary • Software domain vocabulary • Filtering and stemming
  • 8. RQ1. Degree of Domain Terms in Source Code • On average, 42% of the domain terms appear in the software domain vocabulary of one library • 77% of the domain terms were used in at least one of the six libraries • The size of the software domain vocabularies is correlated with the size of the software systems and software vocabularies
  • 9. RQ2. Domain Terms in Identifiers and Comments • On average, 90% of the domain terms found in a software library are found in comments, whereas only 78% are found in identifiers -> comments richer source of domain terms and should not be ignored • 23% of domain terms found in a software library are found only in comments, whereas only 11% are found only in identifiers -> comments complete the domain information when missing from identifiers
  • 10. RQ3. Lexical Agreement • Software libraries – partial summaries of a problem domain • Pair-wise lexical agreement measure from summarization • Agreement of 63% between pairs of libraries (compared to 24% in document summaries) • 18 domain terms used in all libraries • Up to 98% of domain terms reused between pairs of libraries
  • 11. Threats to Validity • Only one domain considered • List of domain terms and concepts manually picked • Verification in source code of term meaning • Considered the software libraries as partial summaries of a domain
  • 12. Future Work • More case studies, different domains • Verification of the meaning of terms in source code - consider word relationships to determine the meaning of words automatically • Analyze other software artifacts (requirement documents, user manuals, bug reports, etc.)
  • 13. Conclusion • 42% of domain terms are found in the source code -> domain ontologies constructed from the source code will be far from complete • Comments are a richer source of domain terms than identifiers and contain extra domain terms -> comments should not be ignored by tools nor by programmers • High lexical agreement between programmers -> developers familiar with the domain will have an easier time understanding source code in the same domain, even when written by others. -> domain-specific tools