SlideShare a Scribd company logo
1 of 12
Download to read offline
The Comment Density
of Open Source Software Code
            Oliver Arafat                                          Dirk Riehle
 Siemens AG, Corporate Technology                         SAP Research, SAP Labs LLC
        oarafat@gmail.com                              dirk@riehle.org, http://dirkriehle.com
                                                           http://twitter.com/dirkriehle 



 Oliver Arafat, Dirk Riehle. “The Comment Density of Open Source Software 
Code.” In Companion to Proceedings of the 31st International Conference 
 on Software Engineering (ICSE 2009). IEEE Press, 2009. Page 195­198.
   [Link: http://dirkriehle.com/2009/02/04/the­comment­density­of­open­source­software­code/]
Overview
Summary (blog post): The Sweet Spot of Code Commenting in Open Source
[Link: http://dirkriehle.com/2009/02/04/the­sweet­spot­of­code­commenting­in­open­source/]


Abstract: The development processes of open source software are different from 
traditional closed source development processes. Still, open source software is 
frequently of high quality. Thus, we are investigating how open source software 
creates high quality and whether it can maintain this quality for ever larger project 
sizes. In this paper, we look at one particular quality indicator, the density of comments 
in open source software code. In a large­scale study of more than 5,000 projects, we 
find that active open source projects document their source code, and we find that the 
comment density is independent of team and project size, but not of project age. In 
future work, we intend to correlate comment density with project success or failure. 

Reference: Oliver Arafat, Dirk Riehle. 
“The Comment Density of Open Source Software Code.” In Companion to 
Proceedings of the 31st International Conference on Software Engineering (ICSE 
2009). IEEE Press, 2009. Page 195­198. [Link: 
http://dirkriehle.com/2009/02/04/the­comment­density­of­open­source­software­code/]

                                                                                      April 6, 2009. Version 1.1. 
                                                                         © 2009 Dirk Riehle. All Rights Reserved.    2
Analysis Process and Tool Chain
                         Raw data source
    Raw Data             • Local database  (ohloh.net snapshot , crawled sources )
   (ohloh.net)           • Web services access  (ohloh.net, sourceforge .net, others)


                         Pre­processing
   SQL, Java             • Database querying using SQL and scripts
                         • Java library for computationally heavyweight filters , aggregation


                         Aggregated data source
   Aggr. Data            • Output of pre ­processing stage for specific analytical tasks
   (RDBMS)               • Aggregated data significantly improves analysis speed


  Excel / Calc,          Analytical processing
  R, spec. tools         • Mines aggregated  (and raw ) data for insights , hypothesis testing
                         • At present basic processing  (Excel), machine learning next


                         Analysis output
                         • Results of analytical processing : averages , distributions , correlations
             3
                 4
                     5
                         • Presented as models , tables, graphs, charts, etc.
                                                                                                                   3
         2
     1
                                                                                    April 6, 2009. Version 1.1. 
                                                                       © 2009 Dirk Riehle. All Rights Reserved.
Commenting Practice in Open Source
                1.E+08

                         y = 0.1596x
                1.E+07
                         R2 = 0.872

                1.E+06
Comment Lines




                1.E+05

                1.E+04

                1.E+03

                1.E+02

                1.E+01

                1.E+00
                    1.E+00    1.E+01   1.E+02   1.E+03   1.E+04   1.E+05   1.E+06         1.E+07           1.E+08        1.E+09

                                         Project Size in Lines of Code (LoC = CL + SLoC)



                                                                                          April 6, 2009. Version 1.1. 
                                                                             © 2009 Dirk Riehle. All Rights Reserved.       4
Comment Density
                  100%
                            mean = 0.1867
                   90%      median = 0.1674
                            stdev = 0.1088
                   80%
                            correl = ­0.00787
                   70%
Comment Density




                   60%

                   50%

                   40%

                   30%

                   20%

                   10%

                    0%
                     1.E+00      1.E+01         1.E+02   1.E+03     1.E+04    1.E+05       1.E+06         1.E+07          1.E+08           1.E+09

                                                  Project Size in Lines of Code (LoC = CL + SLoC)
• Comment density                                                        • Definitions
                  – Percentage of lines of code that are comments            – CL = comment lines
                  – Measure of how well documented code is                   – SLoC = source lines of code
                                                                             – CD = CL / (CL + SLoC)
                  – Indicator of likelihood of survival
                                                                                                            April 6, 2009. Version 1.1. 
                                                                                               © 2009 Dirk Riehle. All Rights Reserved.         5
Variation by Programming Language
                                                                                                                                                                                  Java
                1.E+08                                                                                                                   100%                                                                                                                                                     70%

                             y = 0.2798x                                                                                                  90%        mean = 0.2587
                1.E+07                                                                                                                               median = 0.2566                                                                                                                              60%
                             R2 = 0.9334




                                                                                                                                                                                                                                                                      Percentage of Occurrences
                                                                                                                                          80%        stdev = 0.1111
                1.E+06
                                                                                                                                          70%                                                                                                                                                     50%




                                                                                                                       Comment Density
Comment Lines




                1.E+05
                                                                                                                                          60%
                                                                                                                                                                                                                                                                                                  40%
                                                                                                                                                                                                                                                                                                                                            0.3327
                1.E+04                                                                                                                    50%
                                                                                                                                                                                                                                                                                                  30%
                                                                                                                                          40%                                                                                                                                                                               0.2450                          0.2340
                1.E+03
                                                                                                                                          30%                                                                                                                                                     20%
                1.E+02
                                                                                                                                          20%
                                                                                                                                                                                                                                                                                                            0.0841                                                          0.0841
                                                                                                                                                                                                                                                                                                  10%
                1.E+01                                                                                                                    10%
                                                                                                                                                                                                                                                                                                                                                                                            0.0165        0.0037         0.0000         0.0000
                1.E+00                                                                                                                     0%                                                                                                                                                     0%
                    1.E+00        1.E+01   1.E+02    1.E+03    1.E+04     1.E+05    1.E+06   1.E+07   1.E+08                                1.E+00        1.E+01        1.E+02     1.E+03    1.E+04     1.E+05    1.E+06    1.E+07    1.E+08                                                               [0.0, 0.1[      [0.1, 0.2[      [0.2, 0.3[      [0.3, 0.4[      [0.4, 0.5[     [0.5, 0.6[     [0.6, 0.7[     [0.7, 0.8[     [0.8, 0.9[

                             y = 0.2798x; R2 = 0.9334; mean = 0.2587; median = 0.2566; stdev = 0.1111; population size = 1085
                                              Project Size in Lines of Code (CL + SLoC)                                                                                     Project Size in Lines of Code (CL + SLoC)                                                                                                                                             Comment Density




                                                                                                                                                                                 PERL
                1.E+07                                                                                                           100%                                                                                                                                          70%

                             y = 0.1602x                                                                                                 90%       mean = 0.1044                                                                                                                                         0.5855
                1.E+06                                                                                                                             median = 0.0902                                                                                                             60%
                             R2 = 0.9464




                                                                                                                                                                                                                                               Percentage of Occurrences
                                                                                                                                         80%       stdev = 0.0713
                1.E+05                                                                                                                   70%                                                                                                                                   50%
                                                                                                               Comment Density
Comment Lines




                                                                                                                                         60%
                1.E+04                                                                                                                                                                                                                                                         40%
                                                                                                                                                                                                                                                                                                                         0.3382
                                                                                                                                         50%
                1.E+03                                                                                                                                                                                                                                                         30%
                                                                                                                                         40%

                1.E+02                                                                                                                   30%                                                                                                                                   20%
                                                                                                                                         20%
                1.E+01
                               y = 0.1602x; R2 = 0.9464; mean = 0.1044; median = 0.0902; stdev = 0.0713; population size = 273           10%
                                                                                                                                                                                                                                                                               10%                                                       0.0582
                                                                                                                                                                                                                                                                                                                                                         0.0073          0.0109          0.0000         0.0000         0.0000         0.0000
                1.E+00                                                                                                                   0%                                                                                                                                            0%
                    1.E+00        1.E+01   1.E+02    1.E+03    1.E+04    1.E+05     1.E+06   1.E+07   1.E+08                              1.E+00        1.E+01         1.E+02    1.E+03     1.E+04    1.E+05     1.E+06    1.E+07    1.E+08                                                             [0.0, 0.1[      [0.1, 0.2[      [0.2, 0.3[      [0.3, 0.4[      [0.4, 0.5[      [0.5, 0.6[     [0.6, 0.7[     [0.7, 0.8[     [0.8, 0.9[
                                              Project Size in Lines of Code (CL + SLoC)                                                                                   Project Size in Lines of Code (CL + SLoC)                                                                                                                                           Comment Density




                                           More here: How Open Source Comments (by Programming Language)
                                    [Link: http://dirkriehle.com/2008/11/10/how­open­source­comments­by­programming­language/]
                                                                                                             April 6, 2009. Version 1.1. 
                                                                                                                                                                                                                                                             © 2009 Dirk Riehle. All Rights Reserved.                                                                                                                                     6
Comment Density by Commit Size

                  100%
                             sloc size = 1­100    sloc size = 50­100    sloc size = 80­100           Comment Density by SLoC Size
                  90%        mean = 0.2513        mean = 0.2234         mean = 0.2224                Total Average Comment Density
                             median = 0.2340      median = 0.2190       median = 0.2171
                  80%
                             stdev = 0.0626       stdev = 0.0169        stdev = 0.0209
                  70%
Comment Density




                  60%

                  50%

                  40%

                  30%

                  20%

                  10%

                   0%
                         0        10         20          30        40          50            60      70             80             90         100

                                                      Source Lines of Code in Commit [SLoC]


                                                                                                               April 6, 2009. Version 1.1. 
                                                                                                  © 2009 Dirk Riehle. All Rights Reserved.      7
Comment Density by Team Size

                  50%                                                                                                                      50%
                            team size = 1­20    team size = 1­50   team size = 1­100        Comment Density by Team Size
                  45%       mean = 0.1914       mean = 0.1922      mean = 0.1856                                                           45%
                                                                                            Total Average Comment Density
                            median = 0.1878     median = 0.1906    median = 0.1857
                  40%                                                                       Percent Projects with Team Size                40%
                            stdev = 0.0255      stdev = 0.0425     stdev = 0.0641
                                                                   correl = ­0.0550
                  35%                                                                                                                      35%
Comment Density




                  30%                                                                                                                      30%

                  25%                                                                                                                      25%

                  20%                                                                                                                      20%

                  15%                                                                                                                      15%

                  10%                                                                                                                      10%

                  5%                                                                                                                       5%

                  0%                                                                                                                        0%
                        0         10           20      30          40       50         60    70            80             90             100

                                                         Team Size [Number of Committers]



                                                                                                          April 6, 2009. Version 1.1. 
                                                                                             © 2009 Dirk Riehle. All Rights Reserved.        8
Comment Density by Project Age

                          19.50%
                                       correl = ­0.9054
                          19.25%
Average Comment Density




                          19.00%

                          18.75%


                          18.50%

                          18.25%

                          18.00%

                          17.75%

                          17.50%
                                   0      3       6       9   12   15   18   21   24   27   30       33       36       39        42          45   48

                                                                        Project Age [months]


                                                                                                              April 6, 2009. Version 1.1. 
                                                                                                 © 2009 Dirk Riehle. All Rights Reserved.         9
Commenting Practice Summary
• Successful open source projects follow a consistent commenting practice
   – 1 out of 20 commits serves commenting purposes only
   – 1 out of 5 non­empty lines is a comment line


• Average comment density is independent of project size

• Average comment density is independent of team size

• Average comment density slowly falls with project age




                                                                        April 6, 2009. Version 1.1. 
                                                           © 2009 Dirk Riehle. All Rights Reserved.    10
References
• Dirk Riehle, John Ellenberger, Tamir Menahem, Boris Mikhailovski, Yuri Natchetoi, Barak Naveh, Thomas 
  Odenwald. “Bringing Open Source Best Practices into Corporations Using a Software Forge.” IEEE 
  Software, 2009. See: 
  http://dirkriehle.com/2009/02/11/open­collaboration­within­corporations­using­software­forges/ 
• Dirk Riehle. “The Economic Motivation of Open Source: Stakeholder Perspectives.” IEEE Computer, vol. 
  40, no. 4 (April 2007). Page 25­32. See: 
  http://dirkriehle.com/computer­science/research/2007/computer­2007.html 
• Oliver Arafat, Dirk Riehle. “The Commit Size Distribution of Open Source Software.” In Proceedings of 
  the 42nd Hawaiian International Conference on System Sciences (HICSS 42). IEEE Press, 2009. See: 
  http://dirkriehle.com/2008/09/23/the­commit­size­distribution­of­open­source­software/ 
• Amit Deshpande, Dirk Riehle. “The Total Growth of Open Source.” In Proceedings of the Fourth 
  Conference on Open Source Systems (OSS 2008). Springer Verlag, 2008. Page 197­209. See: 
  http://dirkriehle.com/2008/03/14/the­total­growth­of­open­source/ 
• Amit Deshpande, Dirk Riehle. “Continuous Integration in Open Source Software Development.” In 
  Proceedings of the Fourth Conference on Open Source Systems (OSS 2008). Springer Verlag, 2008. 
  Page 273­280. See: 
  http://dirkriehle.com/2008/03/08/continuous­integration­in­open­source­software­development/ 
• Philipp Hofmann, Dirk Riehle. “Estimating Commit Sizes Efficiently.” In Proceedings of the 5th 
  International Conference on Open Source Systems (OSS 2009). Springer Verlag, 2009. Page 105­115. 
  See: http://dirkriehle.com/2009/02/11/estimating­commit­sizes­efficiently/ 
• Oliver Arafat, Dirk Riehle. “The Comment Density of Open Source Software Code.” In Companion to 
  Proceedings of the 31st International Conference on Software Engineering (ICSE 2009). IEEE Press, 
  2009. Page 195­198. See: 
  http://dirkriehle.com/2009/02/04/the­comment­density­of­open­source­software­code/                           11
                                                                                        April 6, 2009. Version 1.1. 
                                                                           © 2009 Dirk Riehle. All Rights Reserved.
Thank you! Questions?
      For any feedback or questions, please email the authors:


         Oliver Arafat                            Dirk Riehle
Siemens AG, Corporate Technology          SAP Research, SAP Labs LLC
        oarafat@gmail.com              dirk@riehle.org, http://dirkriehle.com
                                           http://twitter.com/dirkriehle 

More Related Content

Similar to The Comment Density of Open Source Software Code

Managing a Website Performance Optimization (WPO) Project
Managing a Website Performance Optimization (WPO) ProjectManaging a Website Performance Optimization (WPO) Project
Managing a Website Performance Optimization (WPO) ProjectYottaa
 
Agile Workshop: Agile Metrics
Agile Workshop: Agile MetricsAgile Workshop: Agile Metrics
Agile Workshop: Agile MetricsSiddhi
 
Machine learning overview (with SAS software)
Machine learning overview (with SAS software)Machine learning overview (with SAS software)
Machine learning overview (with SAS software)Longhow Lam
 
The business case for automated software engineering
The business case for automated software engineering The business case for automated software engineering
The business case for automated software engineering CS, NcState
 
System Monitoring With Nagios PowerPoint Presentation Slides
System Monitoring With Nagios PowerPoint Presentation SlidesSystem Monitoring With Nagios PowerPoint Presentation Slides
System Monitoring With Nagios PowerPoint Presentation SlidesSlideTeam
 
My History with Atlassian Tools, and Why I'm Moving to Studio
My History with Atlassian Tools, and Why I'm Moving to StudioMy History with Atlassian Tools, and Why I'm Moving to Studio
My History with Atlassian Tools, and Why I'm Moving to StudioAtlassian
 
Win Friends and Influence People... with DSLs
Win Friends and Influence People... with DSLsWin Friends and Influence People... with DSLs
Win Friends and Influence People... with DSLsVladimir Bacvanski, PhD
 
Predicting Real Estate Prices with an ANN
Predicting Real Estate Prices with an ANNPredicting Real Estate Prices with an ANN
Predicting Real Estate Prices with an ANNChris Armstrong
 
Evaluating Data Freshness in Large Scale Replicated Databases
Evaluating Data Freshness in Large Scale Replicated DatabasesEvaluating Data Freshness in Large Scale Replicated Databases
Evaluating Data Freshness in Large Scale Replicated DatabasesMiguel Araújo
 
SRE - drupal day aveiro 2016
SRE - drupal day aveiro 2016SRE - drupal day aveiro 2016
SRE - drupal day aveiro 2016Ricardo Amaro
 
What Every Client Should Do on Their Oracle SOA Projects
What Every Client Should Do on Their Oracle SOA ProjectsWhat Every Client Should Do on Their Oracle SOA Projects
What Every Client Should Do on Their Oracle SOA ProjectsRevelation Technologies
 
Auto sre with keptn
Auto sre with keptnAuto sre with keptn
Auto sre with keptnLibbySchulze
 
Virtual Meetup: Mule 4 Error Handling and Logging
Virtual Meetup: Mule 4 Error Handling and LoggingVirtual Meetup: Mule 4 Error Handling and Logging
Virtual Meetup: Mule 4 Error Handling and LoggingJimmy Attia
 
How to Make Designer-Friendly Template Engine
How to Make Designer-Friendly Template EngineHow to Make Designer-Friendly Template Engine
How to Make Designer-Friendly Template Enginekwatch
 
WikiSym 2011 Closing Keynote
WikiSym 2011 Closing KeynoteWikiSym 2011 Closing Keynote
WikiSym 2011 Closing KeynoteEd Chi
 
Indirect Rates - What Strategy is Right for your Organization
Indirect Rates - What Strategy is Right for your OrganizationIndirect Rates - What Strategy is Right for your Organization
Indirect Rates - What Strategy is Right for your OrganizationUnanet
 

Similar to The Comment Density of Open Source Software Code (20)

Managing a Website Performance Optimization (WPO) Project
Managing a Website Performance Optimization (WPO) ProjectManaging a Website Performance Optimization (WPO) Project
Managing a Website Performance Optimization (WPO) Project
 
Agile Workshop: Agile Metrics
Agile Workshop: Agile MetricsAgile Workshop: Agile Metrics
Agile Workshop: Agile Metrics
 
Machine learning overview (with SAS software)
Machine learning overview (with SAS software)Machine learning overview (with SAS software)
Machine learning overview (with SAS software)
 
The business case for automated software engineering
The business case for automated software engineering The business case for automated software engineering
The business case for automated software engineering
 
Self-Adaptation of Online Recommender Systems via Feed-Forward Controllers
Self-Adaptation of Online Recommender Systems via Feed-Forward ControllersSelf-Adaptation of Online Recommender Systems via Feed-Forward Controllers
Self-Adaptation of Online Recommender Systems via Feed-Forward Controllers
 
Java script
Java scriptJava script
Java script
 
System Monitoring With Nagios PowerPoint Presentation Slides
System Monitoring With Nagios PowerPoint Presentation SlidesSystem Monitoring With Nagios PowerPoint Presentation Slides
System Monitoring With Nagios PowerPoint Presentation Slides
 
My History with Atlassian Tools, and Why I'm Moving to Studio
My History with Atlassian Tools, and Why I'm Moving to StudioMy History with Atlassian Tools, and Why I'm Moving to Studio
My History with Atlassian Tools, and Why I'm Moving to Studio
 
Simulation
SimulationSimulation
Simulation
 
Win Friends and Influence People... with DSLs
Win Friends and Influence People... with DSLsWin Friends and Influence People... with DSLs
Win Friends and Influence People... with DSLs
 
Predicting Real Estate Prices with an ANN
Predicting Real Estate Prices with an ANNPredicting Real Estate Prices with an ANN
Predicting Real Estate Prices with an ANN
 
Evaluating Data Freshness in Large Scale Replicated Databases
Evaluating Data Freshness in Large Scale Replicated DatabasesEvaluating Data Freshness in Large Scale Replicated Databases
Evaluating Data Freshness in Large Scale Replicated Databases
 
SRE - drupal day aveiro 2016
SRE - drupal day aveiro 2016SRE - drupal day aveiro 2016
SRE - drupal day aveiro 2016
 
What Every Client Should Do on Their Oracle SOA Projects
What Every Client Should Do on Their Oracle SOA ProjectsWhat Every Client Should Do on Their Oracle SOA Projects
What Every Client Should Do on Their Oracle SOA Projects
 
Auto sre with keptn
Auto sre with keptnAuto sre with keptn
Auto sre with keptn
 
Virtual Meetup: Mule 4 Error Handling and Logging
Virtual Meetup: Mule 4 Error Handling and LoggingVirtual Meetup: Mule 4 Error Handling and Logging
Virtual Meetup: Mule 4 Error Handling and Logging
 
How to Make Designer-Friendly Template Engine
How to Make Designer-Friendly Template EngineHow to Make Designer-Friendly Template Engine
How to Make Designer-Friendly Template Engine
 
WikiSym 2011 Closing Keynote
WikiSym 2011 Closing KeynoteWikiSym 2011 Closing Keynote
WikiSym 2011 Closing Keynote
 
Indirect Rates - What Strategy is Right for your Organization
Indirect Rates - What Strategy is Right for your OrganizationIndirect Rates - What Strategy is Right for your Organization
Indirect Rates - What Strategy is Right for your Organization
 
Focus on the Sustainable Market Economy
Focus on the Sustainable Market EconomyFocus on the Sustainable Market Economy
Focus on the Sustainable Market Economy
 

More from Dirk Riehle

Single-Vendor Open Source at the Crossroads
Single-Vendor Open Source at the CrossroadsSingle-Vendor Open Source at the Crossroads
Single-Vendor Open Source at the CrossroadsDirk Riehle
 
Why open source is good for your economy
Why open source is good for your economyWhy open source is good for your economy
Why open source is good for your economyDirk Riehle
 
Startupinformatik
StartupinformatikStartupinformatik
StartupinformatikDirk Riehle
 
The Business of Open Source User Foundations
The Business of Open Source User FoundationsThe Business of Open Source User Foundations
The Business of Open Source User FoundationsDirk Riehle
 
The Business of Open Models
The Business of Open ModelsThe Business of Open Models
The Business of Open ModelsDirk Riehle
 
2010 06-10 - linux-tag - dirk riehle - developer career - web
2010 06-10 - linux-tag - dirk riehle - developer career - web2010 06-10 - linux-tag - dirk riehle - developer career - web
2010 06-10 - linux-tag - dirk riehle - developer career - webDirk Riehle
 
Open Source: A New Developer Career
Open Source: A New Developer CareerOpen Source: A New Developer Career
Open Source: A New Developer CareerDirk Riehle
 
Micro-Blogging in the Enterprise Focus Groups Evaluation
Micro-Blogging in the Enterprise Focus Groups EvaluationMicro-Blogging in the Enterprise Focus Groups Evaluation
Micro-Blogging in the Enterprise Focus Groups EvaluationDirk Riehle
 
Six Easy Pieces of Quantitatively Analyzing Open Source
Six Easy Pieces of Quantitatively Analyzing Open SourceSix Easy Pieces of Quantitatively Analyzing Open Source
Six Easy Pieces of Quantitatively Analyzing Open SourceDirk Riehle
 
Learning From Wikipedia
Learning From WikipediaLearning From Wikipedia
Learning From WikipediaDirk Riehle
 
Open Collaboration
Open CollaborationOpen Collaboration
Open CollaborationDirk Riehle
 

More from Dirk Riehle (12)

Single-Vendor Open Source at the Crossroads
Single-Vendor Open Source at the CrossroadsSingle-Vendor Open Source at the Crossroads
Single-Vendor Open Source at the Crossroads
 
Why open source is good for your economy
Why open source is good for your economyWhy open source is good for your economy
Why open source is good for your economy
 
Startupinformatik
StartupinformatikStartupinformatik
Startupinformatik
 
Tripod
TripodTripod
Tripod
 
The Business of Open Source User Foundations
The Business of Open Source User FoundationsThe Business of Open Source User Foundations
The Business of Open Source User Foundations
 
The Business of Open Models
The Business of Open ModelsThe Business of Open Models
The Business of Open Models
 
2010 06-10 - linux-tag - dirk riehle - developer career - web
2010 06-10 - linux-tag - dirk riehle - developer career - web2010 06-10 - linux-tag - dirk riehle - developer career - web
2010 06-10 - linux-tag - dirk riehle - developer career - web
 
Open Source: A New Developer Career
Open Source: A New Developer CareerOpen Source: A New Developer Career
Open Source: A New Developer Career
 
Micro-Blogging in the Enterprise Focus Groups Evaluation
Micro-Blogging in the Enterprise Focus Groups EvaluationMicro-Blogging in the Enterprise Focus Groups Evaluation
Micro-Blogging in the Enterprise Focus Groups Evaluation
 
Six Easy Pieces of Quantitatively Analyzing Open Source
Six Easy Pieces of Quantitatively Analyzing Open SourceSix Easy Pieces of Quantitatively Analyzing Open Source
Six Easy Pieces of Quantitatively Analyzing Open Source
 
Learning From Wikipedia
Learning From WikipediaLearning From Wikipedia
Learning From Wikipedia
 
Open Collaboration
Open CollaborationOpen Collaboration
Open Collaboration
 

Recently uploaded

Long journey of Ruby standard library at RubyConf AU 2024
Long journey of Ruby standard library at RubyConf AU 2024Long journey of Ruby standard library at RubyConf AU 2024
Long journey of Ruby standard library at RubyConf AU 2024Hiroshi SHIBATA
 
[Webinar] SpiraTest - Setting New Standards in Quality Assurance
[Webinar] SpiraTest - Setting New Standards in Quality Assurance[Webinar] SpiraTest - Setting New Standards in Quality Assurance
[Webinar] SpiraTest - Setting New Standards in Quality AssuranceInflectra
 
Connecting the Dots for Information Discovery.pdf
Connecting the Dots for Information Discovery.pdfConnecting the Dots for Information Discovery.pdf
Connecting the Dots for Information Discovery.pdfNeo4j
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsPixlogix Infotech
 
2024 April Patch Tuesday
2024 April Patch Tuesday2024 April Patch Tuesday
2024 April Patch TuesdayIvanti
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsSergiu Bodiu
 
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc
 
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024BookNet Canada
 
Genislab builds better products and faster go-to-market with Lean project man...
Genislab builds better products and faster go-to-market with Lean project man...Genislab builds better products and faster go-to-market with Lean project man...
Genislab builds better products and faster go-to-market with Lean project man...Farhan Tariq
 
Modern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
Modern Roaming for Notes and Nomad – Cheaper Faster Better StrongerModern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
Modern Roaming for Notes and Nomad – Cheaper Faster Better Strongerpanagenda
 
Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...
Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...
Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...panagenda
 
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxA Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxLoriGlavin3
 
A Framework for Development in the AI Age
A Framework for Development in the AI AgeA Framework for Development in the AI Age
A Framework for Development in the AI AgeCprime
 
Sample pptx for embedding into website for demo
Sample pptx for embedding into website for demoSample pptx for embedding into website for demo
Sample pptx for embedding into website for demoHarshalMandlekar2
 
Generative Artificial Intelligence: How generative AI works.pdf
Generative Artificial Intelligence: How generative AI works.pdfGenerative Artificial Intelligence: How generative AI works.pdf
Generative Artificial Intelligence: How generative AI works.pdfIngrid Airi González
 
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxThe Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxLoriGlavin3
 
Scale your database traffic with Read & Write split using MySQL Router
Scale your database traffic with Read & Write split using MySQL RouterScale your database traffic with Read & Write split using MySQL Router
Scale your database traffic with Read & Write split using MySQL RouterMydbops
 
UiPath Community: Communication Mining from Zero to Hero
UiPath Community: Communication Mining from Zero to HeroUiPath Community: Communication Mining from Zero to Hero
UiPath Community: Communication Mining from Zero to HeroUiPathCommunity
 
The State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxThe State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxLoriGlavin3
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxLoriGlavin3
 

Recently uploaded (20)

Long journey of Ruby standard library at RubyConf AU 2024
Long journey of Ruby standard library at RubyConf AU 2024Long journey of Ruby standard library at RubyConf AU 2024
Long journey of Ruby standard library at RubyConf AU 2024
 
[Webinar] SpiraTest - Setting New Standards in Quality Assurance
[Webinar] SpiraTest - Setting New Standards in Quality Assurance[Webinar] SpiraTest - Setting New Standards in Quality Assurance
[Webinar] SpiraTest - Setting New Standards in Quality Assurance
 
Connecting the Dots for Information Discovery.pdf
Connecting the Dots for Information Discovery.pdfConnecting the Dots for Information Discovery.pdf
Connecting the Dots for Information Discovery.pdf
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and Cons
 
2024 April Patch Tuesday
2024 April Patch Tuesday2024 April Patch Tuesday
2024 April Patch Tuesday
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platforms
 
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
 
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
 
Genislab builds better products and faster go-to-market with Lean project man...
Genislab builds better products and faster go-to-market with Lean project man...Genislab builds better products and faster go-to-market with Lean project man...
Genislab builds better products and faster go-to-market with Lean project man...
 
Modern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
Modern Roaming for Notes and Nomad – Cheaper Faster Better StrongerModern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
Modern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
 
Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...
Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...
Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...
 
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxA Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
 
A Framework for Development in the AI Age
A Framework for Development in the AI AgeA Framework for Development in the AI Age
A Framework for Development in the AI Age
 
Sample pptx for embedding into website for demo
Sample pptx for embedding into website for demoSample pptx for embedding into website for demo
Sample pptx for embedding into website for demo
 
Generative Artificial Intelligence: How generative AI works.pdf
Generative Artificial Intelligence: How generative AI works.pdfGenerative Artificial Intelligence: How generative AI works.pdf
Generative Artificial Intelligence: How generative AI works.pdf
 
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxThe Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
 
Scale your database traffic with Read & Write split using MySQL Router
Scale your database traffic with Read & Write split using MySQL RouterScale your database traffic with Read & Write split using MySQL Router
Scale your database traffic with Read & Write split using MySQL Router
 
UiPath Community: Communication Mining from Zero to Hero
UiPath Community: Communication Mining from Zero to HeroUiPath Community: Communication Mining from Zero to Hero
UiPath Community: Communication Mining from Zero to Hero
 
The State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxThe State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptx
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
 

The Comment Density of Open Source Software Code

  • 1. The Comment Density of Open Source Software Code Oliver Arafat Dirk Riehle Siemens AG, Corporate Technology SAP Research, SAP Labs LLC oarafat@gmail.com dirk@riehle.org, http://dirkriehle.com http://twitter.com/dirkriehle  Oliver Arafat, Dirk Riehle. “The Comment Density of Open Source Software  Code.” In Companion to Proceedings of the 31st International Conference  on Software Engineering (ICSE 2009). IEEE Press, 2009. Page 195­198. [Link: http://dirkriehle.com/2009/02/04/the­comment­density­of­open­source­software­code/]
  • 2. Overview Summary (blog post): The Sweet Spot of Code Commenting in Open Source [Link: http://dirkriehle.com/2009/02/04/the­sweet­spot­of­code­commenting­in­open­source/] Abstract: The development processes of open source software are different from  traditional closed source development processes. Still, open source software is  frequently of high quality. Thus, we are investigating how open source software  creates high quality and whether it can maintain this quality for ever larger project  sizes. In this paper, we look at one particular quality indicator, the density of comments  in open source software code. In a large­scale study of more than 5,000 projects, we  find that active open source projects document their source code, and we find that the  comment density is independent of team and project size, but not of project age. In  future work, we intend to correlate comment density with project success or failure.  Reference: Oliver Arafat, Dirk Riehle.  “The Comment Density of Open Source Software Code.” In Companion to  Proceedings of the 31st International Conference on Software Engineering (ICSE  2009). IEEE Press, 2009. Page 195­198. [Link:  http://dirkriehle.com/2009/02/04/the­comment­density­of­open­source­software­code/] April 6, 2009. Version 1.1.  © 2009 Dirk Riehle. All Rights Reserved. 2
  • 3. Analysis Process and Tool Chain Raw data source Raw Data • Local database  (ohloh.net snapshot , crawled sources ) (ohloh.net) • Web services access  (ohloh.net, sourceforge .net, others) Pre­processing SQL, Java • Database querying using SQL and scripts • Java library for computationally heavyweight filters , aggregation Aggregated data source Aggr. Data • Output of pre ­processing stage for specific analytical tasks (RDBMS) • Aggregated data significantly improves analysis speed Excel / Calc, Analytical processing R, spec. tools • Mines aggregated  (and raw ) data for insights , hypothesis testing • At present basic processing  (Excel), machine learning next Analysis output • Results of analytical processing : averages , distributions , correlations 3 4 5 • Presented as models , tables, graphs, charts, etc. 3 2 1 April 6, 2009. Version 1.1.  © 2009 Dirk Riehle. All Rights Reserved.
  • 4. Commenting Practice in Open Source 1.E+08 y = 0.1596x 1.E+07 R2 = 0.872 1.E+06 Comment Lines 1.E+05 1.E+04 1.E+03 1.E+02 1.E+01 1.E+00 1.E+00 1.E+01 1.E+02 1.E+03 1.E+04 1.E+05 1.E+06 1.E+07 1.E+08 1.E+09 Project Size in Lines of Code (LoC = CL + SLoC) April 6, 2009. Version 1.1.  © 2009 Dirk Riehle. All Rights Reserved. 4
  • 5. Comment Density 100% mean = 0.1867 90% median = 0.1674 stdev = 0.1088 80% correl = ­0.00787 70% Comment Density 60% 50% 40% 30% 20% 10% 0% 1.E+00 1.E+01 1.E+02 1.E+03 1.E+04 1.E+05 1.E+06 1.E+07 1.E+08 1.E+09 Project Size in Lines of Code (LoC = CL + SLoC) • Comment density • Definitions – Percentage of lines of code that are comments – CL = comment lines – Measure of how well documented code is – SLoC = source lines of code – CD = CL / (CL + SLoC) – Indicator of likelihood of survival April 6, 2009. Version 1.1.  © 2009 Dirk Riehle. All Rights Reserved. 5
  • 6. Variation by Programming Language Java 1.E+08 100% 70% y = 0.2798x 90% mean = 0.2587 1.E+07 median = 0.2566 60% R2 = 0.9334 Percentage of Occurrences 80% stdev = 0.1111 1.E+06 70% 50% Comment Density Comment Lines 1.E+05 60% 40% 0.3327 1.E+04 50% 30% 40% 0.2450 0.2340 1.E+03 30% 20% 1.E+02 20% 0.0841 0.0841 10% 1.E+01 10% 0.0165 0.0037 0.0000 0.0000 1.E+00 0% 0% 1.E+00 1.E+01 1.E+02 1.E+03 1.E+04 1.E+05 1.E+06 1.E+07 1.E+08 1.E+00 1.E+01 1.E+02 1.E+03 1.E+04 1.E+05 1.E+06 1.E+07 1.E+08 [0.0, 0.1[ [0.1, 0.2[ [0.2, 0.3[ [0.3, 0.4[ [0.4, 0.5[ [0.5, 0.6[ [0.6, 0.7[ [0.7, 0.8[ [0.8, 0.9[ y = 0.2798x; R2 = 0.9334; mean = 0.2587; median = 0.2566; stdev = 0.1111; population size = 1085 Project Size in Lines of Code (CL + SLoC) Project Size in Lines of Code (CL + SLoC) Comment Density PERL 1.E+07 100% 70% y = 0.1602x 90% mean = 0.1044 0.5855 1.E+06 median = 0.0902 60% R2 = 0.9464 Percentage of Occurrences 80% stdev = 0.0713 1.E+05 70% 50% Comment Density Comment Lines 60% 1.E+04 40% 0.3382 50% 1.E+03 30% 40% 1.E+02 30% 20% 20% 1.E+01 y = 0.1602x; R2 = 0.9464; mean = 0.1044; median = 0.0902; stdev = 0.0713; population size = 273 10% 10% 0.0582 0.0073 0.0109 0.0000 0.0000 0.0000 0.0000 1.E+00 0% 0% 1.E+00 1.E+01 1.E+02 1.E+03 1.E+04 1.E+05 1.E+06 1.E+07 1.E+08 1.E+00 1.E+01 1.E+02 1.E+03 1.E+04 1.E+05 1.E+06 1.E+07 1.E+08 [0.0, 0.1[ [0.1, 0.2[ [0.2, 0.3[ [0.3, 0.4[ [0.4, 0.5[ [0.5, 0.6[ [0.6, 0.7[ [0.7, 0.8[ [0.8, 0.9[ Project Size in Lines of Code (CL + SLoC) Project Size in Lines of Code (CL + SLoC) Comment Density More here: How Open Source Comments (by Programming Language) [Link: http://dirkriehle.com/2008/11/10/how­open­source­comments­by­programming­language/] April 6, 2009. Version 1.1.  © 2009 Dirk Riehle. All Rights Reserved. 6
  • 7. Comment Density by Commit Size 100% sloc size = 1­100 sloc size = 50­100 sloc size = 80­100 Comment Density by SLoC Size 90% mean = 0.2513 mean = 0.2234 mean = 0.2224 Total Average Comment Density median = 0.2340 median = 0.2190 median = 0.2171 80% stdev = 0.0626 stdev = 0.0169 stdev = 0.0209 70% Comment Density 60% 50% 40% 30% 20% 10% 0% 0 10 20 30 40 50 60 70 80 90 100 Source Lines of Code in Commit [SLoC] April 6, 2009. Version 1.1.  © 2009 Dirk Riehle. All Rights Reserved. 7
  • 8. Comment Density by Team Size 50% 50% team size = 1­20 team size = 1­50 team size = 1­100 Comment Density by Team Size 45% mean = 0.1914 mean = 0.1922 mean = 0.1856 45% Total Average Comment Density median = 0.1878 median = 0.1906 median = 0.1857 40% Percent Projects with Team Size 40% stdev = 0.0255 stdev = 0.0425 stdev = 0.0641 correl = ­0.0550 35% 35% Comment Density 30% 30% 25% 25% 20% 20% 15% 15% 10% 10% 5% 5% 0% 0% 0 10 20 30 40 50 60 70 80 90 100 Team Size [Number of Committers] April 6, 2009. Version 1.1.  © 2009 Dirk Riehle. All Rights Reserved. 8
  • 9. Comment Density by Project Age 19.50% correl = ­0.9054 19.25% Average Comment Density 19.00% 18.75% 18.50% 18.25% 18.00% 17.75% 17.50% 0 3 6 9 12 15 18 21 24 27 30 33 36 39 42 45 48 Project Age [months] April 6, 2009. Version 1.1.  © 2009 Dirk Riehle. All Rights Reserved. 9
  • 10. Commenting Practice Summary • Successful open source projects follow a consistent commenting practice – 1 out of 20 commits serves commenting purposes only – 1 out of 5 non­empty lines is a comment line • Average comment density is independent of project size • Average comment density is independent of team size • Average comment density slowly falls with project age April 6, 2009. Version 1.1.  © 2009 Dirk Riehle. All Rights Reserved. 10
  • 11. References • Dirk Riehle, John Ellenberger, Tamir Menahem, Boris Mikhailovski, Yuri Natchetoi, Barak Naveh, Thomas  Odenwald. “Bringing Open Source Best Practices into Corporations Using a Software Forge.” IEEE  Software, 2009. See:  http://dirkriehle.com/2009/02/11/open­collaboration­within­corporations­using­software­forges/  • Dirk Riehle. “The Economic Motivation of Open Source: Stakeholder Perspectives.” IEEE Computer, vol.  40, no. 4 (April 2007). Page 25­32. See:  http://dirkriehle.com/computer­science/research/2007/computer­2007.html  • Oliver Arafat, Dirk Riehle. “The Commit Size Distribution of Open Source Software.” In Proceedings of  the 42nd Hawaiian International Conference on System Sciences (HICSS 42). IEEE Press, 2009. See:  http://dirkriehle.com/2008/09/23/the­commit­size­distribution­of­open­source­software/  • Amit Deshpande, Dirk Riehle. “The Total Growth of Open Source.” In Proceedings of the Fourth  Conference on Open Source Systems (OSS 2008). Springer Verlag, 2008. Page 197­209. See:  http://dirkriehle.com/2008/03/14/the­total­growth­of­open­source/  • Amit Deshpande, Dirk Riehle. “Continuous Integration in Open Source Software Development.” In  Proceedings of the Fourth Conference on Open Source Systems (OSS 2008). Springer Verlag, 2008.  Page 273­280. See:  http://dirkriehle.com/2008/03/08/continuous­integration­in­open­source­software­development/  • Philipp Hofmann, Dirk Riehle. “Estimating Commit Sizes Efficiently.” In Proceedings of the 5th  International Conference on Open Source Systems (OSS 2009). Springer Verlag, 2009. Page 105­115.  See: http://dirkriehle.com/2009/02/11/estimating­commit­sizes­efficiently/  • Oliver Arafat, Dirk Riehle. “The Comment Density of Open Source Software Code.” In Companion to  Proceedings of the 31st International Conference on Software Engineering (ICSE 2009). IEEE Press,  2009. Page 195­198. See:  http://dirkriehle.com/2009/02/04/the­comment­density­of­open­source­software­code/  11 April 6, 2009. Version 1.1.  © 2009 Dirk Riehle. All Rights Reserved.
  • 12. Thank you! Questions? For any feedback or questions, please email the authors: Oliver Arafat Dirk Riehle Siemens AG, Corporate Technology SAP Research, SAP Labs LLC oarafat@gmail.com dirk@riehle.org, http://dirkriehle.com http://twitter.com/dirkriehle