SlideShare ist ein Scribd-Unternehmen logo
1 von 56
Is the Pareto Principle
Applicable to the Core
Teams of GitHub Projects?
Kazuhiro
Yamashita
Yasutaka
Kamei
Shane
McIntosh
Naoyasu
Ubayashi
Ahmed
E. Hassan
Core developers play a critical
role
in software development
2
Core developers are responsible
for guiding and coordinating the
development of an OSS project.
The most productive developers
who have made roughly 80% of
the total contributions.
Nakakoji
Mockus
In fact, some argue that core
developers in OSS projects follow the
Pareto Principle
5
Effort Result
80% 80%
20%20%
Pareto Principle in Software
Development
6
20 %
80 % 20 %
80 %
Project
Developers Artifacts
Prior studies have arrived at mixed
conclusions about core teams and the
Pareto Principle
7
Pareto Non-Pareto
Goeminne
IWSQM
Robles
RAMSS
Mockus
TOSEM
Geldenhuys
ECSEAA
Koch
ISJ Dinh-Trong
TSE
The results depend on small number
of case study systems
Other
Prior studies have arrived at mixed
conclusions about core teams and the
Pareto Principle
8
< 10 or 15 Other
Goeminne
IWSQM
Robles
RAMSS
Mockus
TOSEM
Geldenhuys
ECSEAA
Koch
ISJ
Dinh-Trong
TSE
Overview of our study of core
teams on GitHub
19
Applicability of the Pareto Principle
Number of Core Developers
Overview of our study of core
teams on GitHub
20
Core and Non-Core Developers Activities
Applicability of the Pareto Principle
Number of Core Developers
Collecting and analyzing
GitHub data to study core team
activity
21
Filter Heuristics
Core
Non-Core
Core
Non-Core
Calc Prop
Projects
Core
Non-Core
Classify
Commits
Core Team Size Activity
Collecting and analyzing
GitHub data to study core team
activity
22
Filter Heuristics
Core
Non-Core
Projects
22
Core
Non-Core
Calc Prop
Core
Non-Core
Classify
Commits
Core Team Size Activity
Preprocessing GitHub data to handle
forks, duplicates, and to remove
immature projects
23
8,510,504 repositories -> 2,496 repositories
Collecting and analyzing
GitHub data to study core team
activity
24
Filter Heuristics
Core
Non-Core
Projects
24
Core
Non-Core
Calc Prop
Core
Non-Core
Classify
Commits
Core Team Size Activity
Using heuristics to identify core
team members
26
Commit-based LOC-based Access-based
Core Core Core
29
A B C D
Our commit-based core
contributor heuristic
Number of
Commits
= Commit
Step1: Sort contributors by
their number of commits
30
A BC D
Number of
Commits
Step2: Compute the proportion
of commits that each
contributor
32
A BC D
60% 20% 10% 10%
Commits ratio
Step3: Core contributors are those
developers below the 0.8 cumulative
contribution cutoff
33
A BC D
0.8
1.0
0.6
Cumulative
ratio
Pct. CoreDev
2/4*100 = 50%
Num CoreDev
2
Collecting and analyzing
GitHub data to study core team
activity
35
Filter Heuristics
Core
Non-Core
Projects
35
Core
Non-Core
Calc Prop
Core
Non-Core
Classify
Commits
Core Team Size Activity
Overview of our study of core
teams on GitHub
36
Core and Non-Core Developers Activities
Applicability of the Pareto Principle
Number of Core Developers
Overview of our study of core
teams on GitHub
37
Core and Non-Core Developers Activities
Applicability of the Pareto Principle
Number of Core Developers
Collecting and analyzing
GitHub data to study core team
activity
38
Filter Heuristics
Core
Non-Core
Projects
38
Core
Non-Core
Calc Prop
Core
Non-Core
Classify
Commits
Core Team Size Activity
Our approach to study Core
Team Size
40
30%20%10%
Percentage of Core Devs
Compliance with
the Pareto Principle
Stratify projects along the confounding factors
Small Medium Large Small Medium Large Small Medium Large
LOC Total Author Age
The example project does not
follow the Pareto Principle
Core team proportions are
widespread
43
Commit-based Divide by LOC
Often, there are fewer than 15
core developers in a projects
44
Number of core developers in projects
88% 98% 96%
Commit-Based LOC-Based Access-Based
Overview of our study of core
teams on GitHub
45
Core and Non-Core Developers Activities
Applicability of the Pareto Principle
Number of Core Developers
More than half projects do not follow the Pareto
principle
Most of projects have 15 or less core developers
Overview of our study of core
teams on GitHub
48
Core and Non-Core Developers Activities
Applicability of the Pareto Principle
Number of Core Developers
More than half projects do not follow the Pareto
principle
Most of projects have 15 or less core developers
Collecting and analyzing
GitHub data to study core team
activity
49
Filter Heuristics
Core
Non-Core
Projects
49
Core
Non-Core
Calc Prop
Core
Non-Core
Classify
Commits
Core Team Size Activity
Our approach to study
activity
50
By using the keywords, we classify the commits.
Development
Activity Type Keywords
Forward Engineering implement, add, request
Maintenance
Reengineering optimiz, adjust
Corrective Engineering bug, fix, issue, error
Management license, formatting, TODO
No big differences in
proportions of development
activities
54
Commit-Based LOC-Based Access-Based
Overview of our study of core
teams on GitHub
55
Core and Non-Core Developers Activities
Applicability of the Pareto Principle
Number of Core Developers
More than half projects do not follow the Pareto
principle
Most of projects have 15 or less core developers
There are no big differences
between
core and non-core activities
Overview of our study of core
teams on GitHub
56
Core and Non-Core Developers Activities
Applicability of the Pareto Principle
Number of Core Developers
More than half projects do not follow the Pareto
principle
Most of projects have 15 or less core developers
There are no big differences
between
core and non-core activities
Extremely large core team may
be interesting
58
Heuristic -15 16-20 21-50 51-100 101-
Commit-
Based 2,197 98 137 17 47
LOC-
Based 2,454 15 13 4 10
Access-
Based 1,164 24 24 0 0
Many projects face a risk of
bus factor
59
Commit-Based LOC-Based Access-Based
43% (Core=1: 8%) 81% (Core=1: 24%) 54% (Core=1: 21%)
In fact, most of projects have less than 5 core developers
Conclusion
63
64
Core Developer
• additional slides
65
Additional description of our
definition
66
0.8
1.0
A B C D E
Depend on
Name
Commit-based
67
Age Total Author
LOC-based
68
Age Total Author
LOC
Access-based
69
Age Total Author
LOC
Data Extraction
70
8,510,504 repositories -> 4,618 repositories
Data Extraction
71
Data Extraction
72
(1) Filter projects by GHTorrent
Filter forked repositories.
Fork
73
One of the features of GitHub
Fork (clone)
Original
Repository
Fork
Repository
Pull Request
Data Extraction
74
(1) Filter projects by GHTorrent
Filter forked repositories.
Filter less than 10 devs repositories.
Data Extraction
75
(1) Filter projects by GHTorrent
Filter forked repositories.
Filter less than 10 devs repositories.
Filter repositories which is developed
outside of GitHub.
Data Extraction
76
(1) Filter projects by GHTorrent
Filter forked repositories.
Filter less than 10 devs repositories.
Filter repositories which is developed
outside of GitHub.
8,510,504 repositories -> 4,618 repositories
Data Extraction
77
Data Extraction
78
(2) Clone repositories
4,618 repositories -> 4,154 repositories
local server
clone
Data Extraction
79
Data Extraction
80
(3) Filter duplicate projects
Project A Fork of Afork
clone
Project B
register
Clone of A
Data Extraction
81
(3) Filter duplicate projects
4,618 repositories -> 3,533 repositories
Project A Project B
Compare SHAs
c87cce1
e1a7260
f40ccb5
455e44c
8b67f28
651fa5e
655b8be
757dd93
a4cf371
8145880
cf484e3
4e63bde
Data Extraction
82
Data Extraction
83
(4) Calculate metrics
LOC
Total Commits
Total Authors
AgeRepository
Data Extraction
84
Data Extraction
85
(5) Filter projects by metrics
4,618 repositories -> 2,496 repositories
Filter less than 10 devs repositories.
Filter less than 1,000 LOC repositories.

Weitere ähnliche Inhalte

Andere mochten auch

Icse2011 build maintenance
Icse2011 build maintenanceIcse2011 build maintenance
Icse2011 build maintenance
SAIL_QU
 
An Empirical Study of Goto in C Code from GitHub Repositories
An Empirical Study of Goto in C Code from GitHub RepositoriesAn Empirical Study of Goto in C Code from GitHub Repositories
An Empirical Study of Goto in C Code from GitHub Repositories
SAIL_QU
 
JVM JIT compilation overview by Vladimir Ivanov
JVM JIT compilation overview by Vladimir IvanovJVM JIT compilation overview by Vladimir Ivanov
JVM JIT compilation overview by Vladimir Ivanov
ZeroTurnaround
 

Andere mochten auch (6)

Defect Prediction: Accomplishments and Future Challenges
Defect Prediction: Accomplishments and Future ChallengesDefect Prediction: Accomplishments and Future Challenges
Defect Prediction: Accomplishments and Future Challenges
 
Icse2011 build maintenance
Icse2011 build maintenanceIcse2011 build maintenance
Icse2011 build maintenance
 
An Automated Approach for Recommending When to Stop Performance Tests
An Automated Approach for Recommending When to Stop Performance TestsAn Automated Approach for Recommending When to Stop Performance Tests
An Automated Approach for Recommending When to Stop Performance Tests
 
A Holistic Approach to Evolving Software Systems
A Holistic Approach to Evolving Software SystemsA Holistic Approach to Evolving Software Systems
A Holistic Approach to Evolving Software Systems
 
An Empirical Study of Goto in C Code from GitHub Repositories
An Empirical Study of Goto in C Code from GitHub RepositoriesAn Empirical Study of Goto in C Code from GitHub Repositories
An Empirical Study of Goto in C Code from GitHub Repositories
 
JVM JIT compilation overview by Vladimir Ivanov
JVM JIT compilation overview by Vladimir IvanovJVM JIT compilation overview by Vladimir Ivanov
JVM JIT compilation overview by Vladimir Ivanov
 

Ähnlich wie Revisiting the Applicability of the Pareto Principle to Core Development Teams in Open Source Software Projects

Enabling Agility Through DevOps
Enabling Agility Through DevOpsEnabling Agility Through DevOps
Enabling Agility Through DevOps
Leland Newsom CSP-SM, SPC5, SDP
 
Using HPC Resources to Exploit Big Data for Code Review Analytics
Using HPC Resources to Exploit Big Data for Code Review AnalyticsUsing HPC Resources to Exploit Big Data for Code Review Analytics
Using HPC Resources to Exploit Big Data for Code Review Analytics
The University of Adelaide
 

Ähnlich wie Revisiting the Applicability of the Pareto Principle to Core Development Teams in Open Source Software Projects (20)

Balancing DevOps Speed with Quality
Balancing DevOps Speed with QualityBalancing DevOps Speed with Quality
Balancing DevOps Speed with Quality
 
It's all about feedback - code review as a great tool in the agile toolbox
It's all about feedback - code review as a great tool in the agile toolboxIt's all about feedback - code review as a great tool in the agile toolbox
It's all about feedback - code review as a great tool in the agile toolbox
 
Using Github Insight as metric for the Developer collaboration and work metri...
Using Github Insight as metric for the Developer collaboration and work metri...Using Github Insight as metric for the Developer collaboration and work metri...
Using Github Insight as metric for the Developer collaboration and work metri...
 
Introduction to Github for Team Project
Introduction to Github for Team ProjectIntroduction to Github for Team Project
Introduction to Github for Team Project
 
[DSC Croatia 22] How we create and leverage data services in GitLab - Radovan...
[DSC Croatia 22] How we create and leverage data services in GitLab - Radovan...[DSC Croatia 22] How we create and leverage data services in GitLab - Radovan...
[DSC Croatia 22] How we create and leverage data services in GitLab - Radovan...
 
Enabling Agility Through DevOps
Enabling Agility Through DevOpsEnabling Agility Through DevOps
Enabling Agility Through DevOps
 
Using HPC Resources to Exploit Big Data for Code Review Analytics
Using HPC Resources to Exploit Big Data for Code Review AnalyticsUsing HPC Resources to Exploit Big Data for Code Review Analytics
Using HPC Resources to Exploit Big Data for Code Review Analytics
 
DevOps 1 (1).pptx
DevOps 1 (1).pptxDevOps 1 (1).pptx
DevOps 1 (1).pptx
 
An Ultimate Guide To Hire Python Developer
An Ultimate Guide To Hire Python DeveloperAn Ultimate Guide To Hire Python Developer
An Ultimate Guide To Hire Python Developer
 
AEM.Design - Project Introduction
AEM.Design - Project IntroductionAEM.Design - Project Introduction
AEM.Design - Project Introduction
 
Open Source Contribution Policies That Don't Suck
Open Source Contribution Policies That Don't SuckOpen Source Contribution Policies That Don't Suck
Open Source Contribution Policies That Don't Suck
 
Creating a DevOps Practice for Analytics -- Strata Data, September 28, 2017
Creating a DevOps Practice for Analytics -- Strata Data, September 28, 2017Creating a DevOps Practice for Analytics -- Strata Data, September 28, 2017
Creating a DevOps Practice for Analytics -- Strata Data, September 28, 2017
 
Sendachi | 451 | GitHub Webinar: Demystifying Collaboration at Scale: DevOp...
Sendachi | 451 | GitHub Webinar: Demystifying Collaboration at Scale: DevOp...Sendachi | 451 | GitHub Webinar: Demystifying Collaboration at Scale: DevOp...
Sendachi | 451 | GitHub Webinar: Demystifying Collaboration at Scale: DevOp...
 
'Open source contribution policies that don’t suck!'
'Open source contribution policies that don’t suck!''Open source contribution policies that don’t suck!'
'Open source contribution policies that don’t suck!'
 
Gap Survey, Assessment and Analysis for DevSecOps
Gap Survey, Assessment and Analysis for DevSecOpsGap Survey, Assessment and Analysis for DevSecOps
Gap Survey, Assessment and Analysis for DevSecOps
 
Automatic Identification of Informative Code in Stack Overflow Posts
Automatic Identification of Informative Code in Stack Overflow PostsAutomatic Identification of Informative Code in Stack Overflow Posts
Automatic Identification of Informative Code in Stack Overflow Posts
 
PMI Thailand: DevOps / Roles of Project Manager (20-May-2020)
PMI Thailand:   DevOps / Roles of Project Manager (20-May-2020)PMI Thailand:   DevOps / Roles of Project Manager (20-May-2020)
PMI Thailand: DevOps / Roles of Project Manager (20-May-2020)
 
Dsc 2021 presentation_radovan_bacovic
Dsc 2021 presentation_radovan_bacovicDsc 2021 presentation_radovan_bacovic
Dsc 2021 presentation_radovan_bacovic
 
DevOps for absolute beginners
DevOps for absolute beginnersDevOps for absolute beginners
DevOps for absolute beginners
 
DevOps applied: Survival guide
DevOps applied: Survival guideDevOps applied: Survival guide
DevOps applied: Survival guide
 

Mehr von SAIL_QU

Studying the Integration Practices and the Evolution of Ad Libraries in the G...
Studying the Integration Practices and the Evolution of Ad Libraries in the G...Studying the Integration Practices and the Evolution of Ad Libraries in the G...
Studying the Integration Practices and the Evolution of Ad Libraries in the G...
SAIL_QU
 
Studying User-Developer Interactions Through the Distribution and Reviewing M...
Studying User-Developer Interactions Through the Distribution and Reviewing M...Studying User-Developer Interactions Through the Distribution and Reviewing M...
Studying User-Developer Interactions Through the Distribution and Reviewing M...
SAIL_QU
 
Studying online distribution platforms for games through the mining of data f...
Studying online distribution platforms for games through the mining of data f...Studying online distribution platforms for games through the mining of data f...
Studying online distribution platforms for games through the mining of data f...
SAIL_QU
 
Understanding the Factors for Fast Answers in Technical Q&A Websites: An Empi...
Understanding the Factors for Fast Answers in Technical Q&A Websites: An Empi...Understanding the Factors for Fast Answers in Technical Q&A Websites: An Empi...
Understanding the Factors for Fast Answers in Technical Q&A Websites: An Empi...
SAIL_QU
 
Investigating the Challenges in Selenium Usage and Improving the Testing Effi...
Investigating the Challenges in Selenium Usage and Improving the Testing Effi...Investigating the Challenges in Selenium Usage and Improving the Testing Effi...
Investigating the Challenges in Selenium Usage and Improving the Testing Effi...
SAIL_QU
 
Mining Development Knowledge to Understand and Support Software Logging Pract...
Mining Development Knowledge to Understand and Support Software Logging Pract...Mining Development Knowledge to Understand and Support Software Logging Pract...
Mining Development Knowledge to Understand and Support Software Logging Pract...
SAIL_QU
 

Mehr von SAIL_QU (20)

Studying the Integration Practices and the Evolution of Ad Libraries in the G...
Studying the Integration Practices and the Evolution of Ad Libraries in the G...Studying the Integration Practices and the Evolution of Ad Libraries in the G...
Studying the Integration Practices and the Evolution of Ad Libraries in the G...
 
Studying the Dialogue Between Users and Developers of Free Apps in the Google...
Studying the Dialogue Between Users and Developers of Free Apps in the Google...Studying the Dialogue Between Users and Developers of Free Apps in the Google...
Studying the Dialogue Between Users and Developers of Free Apps in the Google...
 
Improving the testing efficiency of selenium-based load tests
Improving the testing efficiency of selenium-based load testsImproving the testing efficiency of selenium-based load tests
Improving the testing efficiency of selenium-based load tests
 
Studying User-Developer Interactions Through the Distribution and Reviewing M...
Studying User-Developer Interactions Through the Distribution and Reviewing M...Studying User-Developer Interactions Through the Distribution and Reviewing M...
Studying User-Developer Interactions Through the Distribution and Reviewing M...
 
Studying online distribution platforms for games through the mining of data f...
Studying online distribution platforms for games through the mining of data f...Studying online distribution platforms for games through the mining of data f...
Studying online distribution platforms for games through the mining of data f...
 
Understanding the Factors for Fast Answers in Technical Q&A Websites: An Empi...
Understanding the Factors for Fast Answers in Technical Q&A Websites: An Empi...Understanding the Factors for Fast Answers in Technical Q&A Websites: An Empi...
Understanding the Factors for Fast Answers in Technical Q&A Websites: An Empi...
 
Investigating the Challenges in Selenium Usage and Improving the Testing Effi...
Investigating the Challenges in Selenium Usage and Improving the Testing Effi...Investigating the Challenges in Selenium Usage and Improving the Testing Effi...
Investigating the Challenges in Selenium Usage and Improving the Testing Effi...
 
Mining Development Knowledge to Understand and Support Software Logging Pract...
Mining Development Knowledge to Understand and Support Software Logging Pract...Mining Development Knowledge to Understand and Support Software Logging Pract...
Mining Development Knowledge to Understand and Support Software Logging Pract...
 
Which Log Level Should Developers Choose For a New Logging Statement?
Which Log Level Should Developers Choose For a New Logging Statement?Which Log Level Should Developers Choose For a New Logging Statement?
Which Log Level Should Developers Choose For a New Logging Statement?
 
Towards Just-in-Time Suggestions for Log Changes
Towards Just-in-Time Suggestions for Log ChangesTowards Just-in-Time Suggestions for Log Changes
Towards Just-in-Time Suggestions for Log Changes
 
The Impact of Task Granularity on Co-evolution Analyses
The Impact of Task Granularity on Co-evolution AnalysesThe Impact of Task Granularity on Co-evolution Analyses
The Impact of Task Granularity on Co-evolution Analyses
 
A Framework for Evaluating the Results of the SZZ Approach for Identifying Bu...
A Framework for Evaluating the Results of the SZZ Approach for Identifying Bu...A Framework for Evaluating the Results of the SZZ Approach for Identifying Bu...
A Framework for Evaluating the Results of the SZZ Approach for Identifying Bu...
 
How are Discussions Associated with Bug Reworking? An Empirical Study on Open...
How are Discussions Associated with Bug Reworking? An Empirical Study on Open...How are Discussions Associated with Bug Reworking? An Empirical Study on Open...
How are Discussions Associated with Bug Reworking? An Empirical Study on Open...
 
A Study of the Relation of Mobile Device Attributes with the User-Perceived Q...
A Study of the Relation of Mobile Device Attributes with the User-Perceived Q...A Study of the Relation of Mobile Device Attributes with the User-Perceived Q...
A Study of the Relation of Mobile Device Attributes with the User-Perceived Q...
 
A Large-Scale Study of the Impact of Feature Selection Techniques on Defect C...
A Large-Scale Study of the Impact of Feature Selection Techniques on Defect C...A Large-Scale Study of the Impact of Feature Selection Techniques on Defect C...
A Large-Scale Study of the Impact of Feature Selection Techniques on Defect C...
 
Studying the Dialogue Between Users and Developers of Free Apps in the Google...
Studying the Dialogue Between Users and Developers of Free Apps in the Google...Studying the Dialogue Between Users and Developers of Free Apps in the Google...
Studying the Dialogue Between Users and Developers of Free Apps in the Google...
 
What Do Programmers Know about Software Energy Consumption?
What Do Programmers Know about Software Energy Consumption?What Do Programmers Know about Software Energy Consumption?
What Do Programmers Know about Software Energy Consumption?
 
Threshold for Size and Complexity Metrics: A Case Study from the Perspective ...
Threshold for Size and Complexity Metrics: A Case Study from the Perspective ...Threshold for Size and Complexity Metrics: A Case Study from the Perspective ...
Threshold for Size and Complexity Metrics: A Case Study from the Perspective ...
 
Revisiting the Experimental Design Choices for Approaches for the Automated R...
Revisiting the Experimental Design Choices for Approaches for the Automated R...Revisiting the Experimental Design Choices for Approaches for the Automated R...
Revisiting the Experimental Design Choices for Approaches for the Automated R...
 
Measuring Program Comprehension: A Large-Scale Field Study with Professionals
Measuring Program Comprehension: A Large-Scale Field Study with ProfessionalsMeasuring Program Comprehension: A Large-Scale Field Study with Professionals
Measuring Program Comprehension: A Large-Scale Field Study with Professionals
 

Kürzlich hochgeladen

introduction-to-automotive Andoid os-csimmonds-ndctechtown-2021.pdf
introduction-to-automotive Andoid os-csimmonds-ndctechtown-2021.pdfintroduction-to-automotive Andoid os-csimmonds-ndctechtown-2021.pdf
introduction-to-automotive Andoid os-csimmonds-ndctechtown-2021.pdf
VishalKumarJha10
 
TECUNIQUE: Success Stories: IT Service provider
TECUNIQUE: Success Stories: IT Service providerTECUNIQUE: Success Stories: IT Service provider
TECUNIQUE: Success Stories: IT Service provider
mohitmore19
 

Kürzlich hochgeladen (20)

5 Signs You Need a Fashion PLM Software.pdf
5 Signs You Need a Fashion PLM Software.pdf5 Signs You Need a Fashion PLM Software.pdf
5 Signs You Need a Fashion PLM Software.pdf
 
Azure_Native_Qumulo_High_Performance_Compute_Benchmarks.pdf
Azure_Native_Qumulo_High_Performance_Compute_Benchmarks.pdfAzure_Native_Qumulo_High_Performance_Compute_Benchmarks.pdf
Azure_Native_Qumulo_High_Performance_Compute_Benchmarks.pdf
 
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
 
Optimizing AI for immediate response in Smart CCTV
Optimizing AI for immediate response in Smart CCTVOptimizing AI for immediate response in Smart CCTV
Optimizing AI for immediate response in Smart CCTV
 
A Secure and Reliable Document Management System is Essential.docx
A Secure and Reliable Document Management System is Essential.docxA Secure and Reliable Document Management System is Essential.docx
A Secure and Reliable Document Management System is Essential.docx
 
Right Money Management App For Your Financial Goals
Right Money Management App For Your Financial GoalsRight Money Management App For Your Financial Goals
Right Money Management App For Your Financial Goals
 
VTU technical seminar 8Th Sem on Scikit-learn
VTU technical seminar 8Th Sem on Scikit-learnVTU technical seminar 8Th Sem on Scikit-learn
VTU technical seminar 8Th Sem on Scikit-learn
 
Direct Style Effect Systems - The Print[A] Example - A Comprehension Aid
Direct Style Effect Systems -The Print[A] Example- A Comprehension AidDirect Style Effect Systems -The Print[A] Example- A Comprehension Aid
Direct Style Effect Systems - The Print[A] Example - A Comprehension Aid
 
introduction-to-automotive Andoid os-csimmonds-ndctechtown-2021.pdf
introduction-to-automotive Andoid os-csimmonds-ndctechtown-2021.pdfintroduction-to-automotive Andoid os-csimmonds-ndctechtown-2021.pdf
introduction-to-automotive Andoid os-csimmonds-ndctechtown-2021.pdf
 
Exploring the Best Video Editing App.pdf
Exploring the Best Video Editing App.pdfExploring the Best Video Editing App.pdf
Exploring the Best Video Editing App.pdf
 
Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...
Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...
Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...
 
HR Software Buyers Guide in 2024 - HRSoftware.com
HR Software Buyers Guide in 2024 - HRSoftware.comHR Software Buyers Guide in 2024 - HRSoftware.com
HR Software Buyers Guide in 2024 - HRSoftware.com
 
Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...
Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...
Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...
 
W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...
W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...
W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...
 
Diamond Application Development Crafting Solutions with Precision
Diamond Application Development Crafting Solutions with PrecisionDiamond Application Development Crafting Solutions with Precision
Diamond Application Development Crafting Solutions with Precision
 
Introducing Microsoft’s new Enterprise Work Management (EWM) Solution
Introducing Microsoft’s new Enterprise Work Management (EWM) SolutionIntroducing Microsoft’s new Enterprise Work Management (EWM) Solution
Introducing Microsoft’s new Enterprise Work Management (EWM) Solution
 
How To Troubleshoot Collaboration Apps for the Modern Connected Worker
How To Troubleshoot Collaboration Apps for the Modern Connected WorkerHow To Troubleshoot Collaboration Apps for the Modern Connected Worker
How To Troubleshoot Collaboration Apps for the Modern Connected Worker
 
TECUNIQUE: Success Stories: IT Service provider
TECUNIQUE: Success Stories: IT Service providerTECUNIQUE: Success Stories: IT Service provider
TECUNIQUE: Success Stories: IT Service provider
 
The Ultimate Test Automation Guide_ Best Practices and Tips.pdf
The Ultimate Test Automation Guide_ Best Practices and Tips.pdfThe Ultimate Test Automation Guide_ Best Practices and Tips.pdf
The Ultimate Test Automation Guide_ Best Practices and Tips.pdf
 
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...
 

Revisiting the Applicability of the Pareto Principle to Core Development Teams in Open Source Software Projects

  • 1. Is the Pareto Principle Applicable to the Core Teams of GitHub Projects? Kazuhiro Yamashita Yasutaka Kamei Shane McIntosh Naoyasu Ubayashi Ahmed E. Hassan
  • 2. Core developers play a critical role in software development 2 Core developers are responsible for guiding and coordinating the development of an OSS project. The most productive developers who have made roughly 80% of the total contributions. Nakakoji Mockus
  • 3. In fact, some argue that core developers in OSS projects follow the Pareto Principle 5 Effort Result 80% 80% 20%20%
  • 4. Pareto Principle in Software Development 6 20 % 80 % 20 % 80 % Project Developers Artifacts
  • 5. Prior studies have arrived at mixed conclusions about core teams and the Pareto Principle 7 Pareto Non-Pareto Goeminne IWSQM Robles RAMSS Mockus TOSEM Geldenhuys ECSEAA Koch ISJ Dinh-Trong TSE The results depend on small number of case study systems Other
  • 6. Prior studies have arrived at mixed conclusions about core teams and the Pareto Principle 8 < 10 or 15 Other Goeminne IWSQM Robles RAMSS Mockus TOSEM Geldenhuys ECSEAA Koch ISJ Dinh-Trong TSE
  • 7. Overview of our study of core teams on GitHub 19 Applicability of the Pareto Principle Number of Core Developers
  • 8. Overview of our study of core teams on GitHub 20 Core and Non-Core Developers Activities Applicability of the Pareto Principle Number of Core Developers
  • 9. Collecting and analyzing GitHub data to study core team activity 21 Filter Heuristics Core Non-Core Core Non-Core Calc Prop Projects Core Non-Core Classify Commits Core Team Size Activity
  • 10. Collecting and analyzing GitHub data to study core team activity 22 Filter Heuristics Core Non-Core Projects 22 Core Non-Core Calc Prop Core Non-Core Classify Commits Core Team Size Activity
  • 11. Preprocessing GitHub data to handle forks, duplicates, and to remove immature projects 23 8,510,504 repositories -> 2,496 repositories
  • 12. Collecting and analyzing GitHub data to study core team activity 24 Filter Heuristics Core Non-Core Projects 24 Core Non-Core Calc Prop Core Non-Core Classify Commits Core Team Size Activity
  • 13. Using heuristics to identify core team members 26 Commit-based LOC-based Access-based Core Core Core
  • 14. 29 A B C D Our commit-based core contributor heuristic Number of Commits = Commit
  • 15. Step1: Sort contributors by their number of commits 30 A BC D Number of Commits
  • 16. Step2: Compute the proportion of commits that each contributor 32 A BC D 60% 20% 10% 10% Commits ratio
  • 17. Step3: Core contributors are those developers below the 0.8 cumulative contribution cutoff 33 A BC D 0.8 1.0 0.6 Cumulative ratio Pct. CoreDev 2/4*100 = 50% Num CoreDev 2
  • 18. Collecting and analyzing GitHub data to study core team activity 35 Filter Heuristics Core Non-Core Projects 35 Core Non-Core Calc Prop Core Non-Core Classify Commits Core Team Size Activity
  • 19. Overview of our study of core teams on GitHub 36 Core and Non-Core Developers Activities Applicability of the Pareto Principle Number of Core Developers
  • 20. Overview of our study of core teams on GitHub 37 Core and Non-Core Developers Activities Applicability of the Pareto Principle Number of Core Developers
  • 21. Collecting and analyzing GitHub data to study core team activity 38 Filter Heuristics Core Non-Core Projects 38 Core Non-Core Calc Prop Core Non-Core Classify Commits Core Team Size Activity
  • 22. Our approach to study Core Team Size 40 30%20%10% Percentage of Core Devs Compliance with the Pareto Principle Stratify projects along the confounding factors Small Medium Large Small Medium Large Small Medium Large LOC Total Author Age The example project does not follow the Pareto Principle
  • 23. Core team proportions are widespread 43 Commit-based Divide by LOC
  • 24. Often, there are fewer than 15 core developers in a projects 44 Number of core developers in projects 88% 98% 96% Commit-Based LOC-Based Access-Based
  • 25. Overview of our study of core teams on GitHub 45 Core and Non-Core Developers Activities Applicability of the Pareto Principle Number of Core Developers More than half projects do not follow the Pareto principle Most of projects have 15 or less core developers
  • 26. Overview of our study of core teams on GitHub 48 Core and Non-Core Developers Activities Applicability of the Pareto Principle Number of Core Developers More than half projects do not follow the Pareto principle Most of projects have 15 or less core developers
  • 27. Collecting and analyzing GitHub data to study core team activity 49 Filter Heuristics Core Non-Core Projects 49 Core Non-Core Calc Prop Core Non-Core Classify Commits Core Team Size Activity
  • 28. Our approach to study activity 50 By using the keywords, we classify the commits. Development Activity Type Keywords Forward Engineering implement, add, request Maintenance Reengineering optimiz, adjust Corrective Engineering bug, fix, issue, error Management license, formatting, TODO
  • 29. No big differences in proportions of development activities 54 Commit-Based LOC-Based Access-Based
  • 30. Overview of our study of core teams on GitHub 55 Core and Non-Core Developers Activities Applicability of the Pareto Principle Number of Core Developers More than half projects do not follow the Pareto principle Most of projects have 15 or less core developers There are no big differences between core and non-core activities
  • 31. Overview of our study of core teams on GitHub 56 Core and Non-Core Developers Activities Applicability of the Pareto Principle Number of Core Developers More than half projects do not follow the Pareto principle Most of projects have 15 or less core developers There are no big differences between core and non-core activities
  • 32. Extremely large core team may be interesting 58 Heuristic -15 16-20 21-50 51-100 101- Commit- Based 2,197 98 137 17 47 LOC- Based 2,454 15 13 4 10 Access- Based 1,164 24 24 0 0
  • 33. Many projects face a risk of bus factor 59 Commit-Based LOC-Based Access-Based 43% (Core=1: 8%) 81% (Core=1: 24%) 54% (Core=1: 21%) In fact, most of projects have less than 5 core developers
  • 35. 64
  • 37. Additional description of our definition 66 0.8 1.0 A B C D E Depend on Name
  • 43. Data Extraction 72 (1) Filter projects by GHTorrent Filter forked repositories.
  • 44. Fork 73 One of the features of GitHub Fork (clone) Original Repository Fork Repository Pull Request
  • 45. Data Extraction 74 (1) Filter projects by GHTorrent Filter forked repositories. Filter less than 10 devs repositories.
  • 46. Data Extraction 75 (1) Filter projects by GHTorrent Filter forked repositories. Filter less than 10 devs repositories. Filter repositories which is developed outside of GitHub.
  • 47. Data Extraction 76 (1) Filter projects by GHTorrent Filter forked repositories. Filter less than 10 devs repositories. Filter repositories which is developed outside of GitHub. 8,510,504 repositories -> 4,618 repositories
  • 49. Data Extraction 78 (2) Clone repositories 4,618 repositories -> 4,154 repositories local server clone
  • 51. Data Extraction 80 (3) Filter duplicate projects Project A Fork of Afork clone Project B register Clone of A
  • 52. Data Extraction 81 (3) Filter duplicate projects 4,618 repositories -> 3,533 repositories Project A Project B Compare SHAs c87cce1 e1a7260 f40ccb5 455e44c 8b67f28 651fa5e 655b8be 757dd93 a4cf371 8145880 cf484e3 4e63bde
  • 54. Data Extraction 83 (4) Calculate metrics LOC Total Commits Total Authors AgeRepository
  • 56. Data Extraction 85 (5) Filter projects by metrics 4,618 repositories -> 2,496 repositories Filter less than 10 devs repositories. Filter less than 1,000 LOC repositories.

Hinweis der Redaktion

  1. I’m Kazuhiro Yamashita, a PhD student at Kyushu University, Japan. Today, I would like to talk about my research. The slide title is “Is the Pareto principle applicable to core teams of github projects?” This is a collaboration work of Kyushu University and Queen’s University.
  2. In this study, we focus on core developers and the Pareto principle. Core developers are developers who play important roles in software development projects. For example, Nakakoji et al. state that core developers are responsible for guiding and coordinating the development of an OSS project. On the other hand, Mockus et al. define core developers as the most productive developers who have made roughly 80% of the total contributions. The definitions are little bit different but both definitions say core developers are important. From the facts, core developers are a key of success for OSS projects. Hence, there are papers which focus on core developers.
  3. This is the agenda of this slide. First we look at the definitions of core developers and the pareto principle. Next, we show the previous results. Then, we show our research questions derived from previous results. After our research questions, we describe our case study. Finally, we conclude this study.
  4. Therefore, there are papers which focus on core developers. And there are some papers which claim that the size of core developers in a successful project is follow the pareto principle.
  5. Some of the papers argue that the proportions of core developers in OSS projects follow the Pareto principle. The Pareto principle is also known as 80-20 rules and it states that roughly 80% of the results come from 20% of the causes like this figure. The principle is originally from economics field, but it is also applied to various kinds of field and software engineering field.
  6. Such papers claim that 20% of developers produce 80% of artifacts in software development context.
  7. As we described, there are papers which claim that the size of core developers in a successful project follows the Pareto principle. On the other hand, there are papers which claim that the size of core developers does not follow the Pareto principle. In other words, prior studies have arrived at mixed conclusions about core teams and the Pareto principle. We assume that the reason why such mixed conclusions are obtained is that the results depend on small number of case study systems. In fact, the prior studies used at most 9 OSS projects.
  8. Addition to the Pareto principle, prior studies also have arrived at mixed conclusions about the number of core developers. Mockus et al. claim that the number of core developers is less than 10 or 15, but some papers show other opinions. For instance, Dinh-Trong et al. showed that 27 to 42 developers contribute to more than 80% of contributions in FreeBSD project.
  9. Therefore, there are papers which focus on core developers. And there are some papers which claim that the size of core developers in a successful project is follow the pareto principle.
  10. Therefore, there are papers which focus on core developers. And there are some papers which claim that the size of core developers in a successful project is follow the pareto principle.
  11. On the other hand, there is a paper which claims that the proportion of core developers do not follow the pareto principle. Addition to the pareto principle, some papers show that the exact number of core developers. But, the numbers are different according to the papers. When we consider why such discrepancies are happened, we find that all results depend on small number of case study systems.
  12. From the previous work, we derive research question 1 and the motivation. In RQ1, we would like to generalize the previous results, in other words, we would like to know the proportion of core developers follow the pareto principle? Additionally, we also would like to know the general number of core developers.
  13. Therefore, we formulate the research question.
  14. Addition to the size of core developers, Mockus et al. claim that a group which is larger by an order of magnitude than the core team, will repair defects. From the state, we assume that non-core developers more work on bug fixing activity than implementing new functions. Therefore, we formulate research question 2 according to the assumption.
  15. The motivation of RQ2 is that we would like to know the proportions of activities of core and non-core developers. By declaring the proportion of activities, we would like to confirm our assumption.
  16. The second research question is that …
  17. From the points, we derived first part of our study. In this part, we focus on core team size and study the applicability of the Pareto principle to core developers using GitHub projects. Not only proportions, but also numbers of core developers are argued in prior studies. Therefore, we also study numbers of core developers in this part.
  18. In the second part of our study, we focus on the activities of core and non-core developers. The part is also derived from a prior study. In prior study, Mockus states that a group, which is larger by an order of magnitude than the core team, will repair defects. From the state, we assume that non-core developers work on more fixing bugs than implementing new functionalities. Hence, we study the activities of core and non-core developers in second part. This is an overview of our study.
  19. Now we show the steps for collecting and analyzing github data to study core team activity. As the common part of both studies, we perform two steps to collect data and identify core developers. After the two steps, we perform both studies. In the study for core team size, we calculate the proportions and numbers of core developers of each project then we identify the proportions follow the Pareto principle or not. In the study for activity, we extract commits of both type of developers then we classify the commits and compare their activities.
  20. We explain each step of our study. First, we show how to filter projects.
  21. In this study, we used GitHub projects as dataset. First of all, the dataset includes 8.5million repositories. However, there are also included repositories such as fork repositories, duplicates and immature projects. To remove such repositories, we preprocess the dataset. After the preprocessing, 2,496 repositories remain. We conduct our case study on the 2,496 repositories.
  22. Next, we show heuristics that we use to identify core developers.
  23. In this study, we used three heuristics to identify core developers. In Commit-based heuristic, we identify core developers using amount of commits of each developer. In LOC-based heuristic, we identify core developers using amount of LOC which is changed by developers. In access based heuristic, we identify core developers using access right. With regard to the access-based heuristic, we can identify core developers from the developer has access right to the repository or not. However, in commit and loc based heuristic, we need to a way to identify core and non-core developers.
  24. We show steps to identify core developers in commit-based heuristic using this example project. In this project, there are 4 developers and they made some commits.
  25. As first step, we sort developers by their number of commits in descending order.
  26. After sorting, we calculate the proportions of commits of each developer. For example, developer A made 6 commits out of 10 commits. Hence, the proportion of developer A is 60%
  27. Finally, we calculate cumulative proportion and identify developers who are below the 0.8 cumulative cutoff as core developers. In this example, developers A and C are core developers, and B and D are non-core developers. The percentage of core developers, in this case, is 50% and the number of core developers is 2. LOC-based heuristic has same steps with commit-based heuristic but it uses LOC instead of the number of commits.
  28. We identified core and non-core developers in each project. Now we show the answers to our questions.
  29. These are our two questions.
  30. First we show the results about core team size. The questions that we address are: Is the Pareto principle applicable? and What is general number of core developers?
  31. Here is the part in this figure.
  32. This slide shows our concrete approach to study core team size. To check the applicability of the Pareto principle, we need to define thresholds. In this study, we define the range between 10% to 30% as the thresholds. Therefore, the example project that we showed to explain steps of our heuristic does not follow the Pareto principle. It is because that the example project has 50% of core developers. Addition to check the applicability, we stratify projects along the confounding factors to find out trends. That’s why we assume that the three factors LOC, total authors and project age may affect the size of core developers. For example, a project that has small total authors tends to be higher proportion of core developers. Since the results of all heuristics and confounding factors have similar trend, we show only the result of commit-based heuristic and dividing by LOC. スライド的に分かる様に
  33. These figures show the results of commit-based heuristics and divided by LOC. From the left side, figures show the distribution of projects small, medium and large LOC projects respectively. In each figure, this dotted lines are
  34. These figures show the results of commit-based heuristics and divided by LOC. The x-axis shows the percentages of core developers and the y-axis shows the number of projects. From the left side, figures show the distribution of projects small, medium and large LOC projects respectively. In each figure, this dotted lines are thresholds of the Pareto principle. From the figures, we find that the proportions of core developers are widespread. In fact, more than half of projects are outside of the range of the Pareto principle. Therefore, we conclude that the proportions of core developers do not follow the Pareto principle.
  35. When we check the number of core developers, almost 90% or more projects have 15 or less core developers.
  36. From the study of core team size, we obtained these results.
  37. From
  38. Next, we address the second question. In this study, we focus on the activities of core and non-core developers.
  39. Here is the part of this study, in this figure. To compare the activities, we need to classify the commits. We first explain the method that we used for this study, then show the results.
  40. To know developer activities, we use the method which is proposed by Hattori and Lanza. The method classifies commits into four categories using the commit comments. This table shows the four categories and the example of keywords. Forward engineering category is for activities to implement new functionalities and representative keyword is “implement”. Reengineering category is for modifying existing codes and the keyword is “optimize”. Corrective Engineering category is for bug fixing activities and the keyword is “bug”. Management category is for activities to control project and the keyword is “TODO”. If any keyword is not appeared in commit comments, the commit is classified into Unknown category. Also, if there is no comment, the commit is classified into Empty category.
  41. This figure shows proportions of categories of each type of developers. For example, blue bars show the proportions of Forward engineering category and yellow bars show corrective engineering. In our assumption, the proportion of non-core developers’ corrective engineering activity is large. However, from the figure, we find that there are no big differences in proportions of corrective engineering. Furthermore, the other three activities have similar proportions.
  42. Therefore, we obtained the conclusion from this study.
  43. Finally, we obtained these results from our study. Now we discuss some points that we can obtain from our results.
  44. First, we think extremely large core team may be interesting. We think it is natural that the proportions of core developers are widespread. But, there are projects that have more than 50% of the proportion of core developers. It may be interesting to find out how to coordinate such large number of core developers and how impact to the project quality. 図を差し替え-&amp;gt;%でなく人数の絶対値にする
  45. First, we think extremely large core team may be interesting. We think it is natural that the proportions of core developers are widespread. But, there are projects that have more than 50 core developers. It may be interesting to find out how to coordinate such large number of core developers and how impact to the project quality.
  46. Next, we think many projects face a risk of bus factor. We showed that many projects have 15 or less core developers. In fact, many of projects have less than 5 core developers. For example, in LOC based heuristic, 81% of projects have less than 5 core developers and 24% of projects have only 1 core developers. From the fact, we assume that many projects face a risk of bus factor.
  47. Now we conclude our slide. First, we showed prior studies and our two questions which are derived from prior studies. Then, we showed our case study design to address the two questions. From the case study, we found that core team proportions are widespread and there are no big differences in proportions of development activity between core and non-core developers. That’s all. Thank you.