Detecting the presence of bots in distributed software development activity is very important in order to prevent bias in socio-technical empirical studies. In previous work, we proposed a classification model to detect bots in GitHub repositories based on the pull request and issue comments of GitHub accounts. The current study generalises the approach to git contributors based on their commit messages. We train and evaluate the classification model on a large dataset of 6,922 git contributors. The original model based on pull request and issue comments obtained a precision of 0.77 on this dataset, whereas retraining the classification model on git commit messages increased the precision to 0.80. As a proof-of-concept, we implemented this model in BoDeGiC, an open source command-line tool to detect bots in git repositories.
THE ROLE OF PHARMACOGNOSY IN TRADITIONAL AND MODERN SYSTEM OF MEDICINE.pptx
Evaluating Bot Detection Models on Code Commits
1. Evaluating a bot detection model
on git commit messages
Mehdi Golzadeh Alexandre Decan Tom Mens
University of Mons - Belgium
BENEVOL 2020
THE 19TH BELGIUM-NETHERLANDS-LUXEMBURG SOFTWARE EVOLUTION WORKSHOP
3. Detecting bots in distributed
software development
activities is very important
identity merging, team productivity, development effort,
developer onboarding or abandonment, …
Pull requests are more than twice as likely to get accepted if bots are
present in their commenting activity
2/14
6. Bot identification
Available ground-truth
datasets
Download comments from GitHub
136,529 repositories
10,874,611 issue and PR commentsSelect an account
Rate (by 2nd rater)Rate (by 1st rater)
Discussion
Rate (3rd rater)
[disagreement]
[difficult case]
[disagreement]
[agreement][agreement]
[include account]
[exclude account]
Current methods and Tools
5/14
7. Classification of accounts
Number of
comments
Grid-search cross-validation,
Random forest classifier – Best score
k-nearest neighbors
Support vector machine
Decision tree classifier
Logistic regression
Number of empty
comments
Comment
patterns
Inequality between
comments in patterns
Ground truth dataset
of 527 bots and 4473 humans
An approach based on characteristics of comments
6/14
8. Evaluation of the classifier
We identified 4 categories
of misclassified humans
Classification model for detecting bots
Based on issue and PR comments
Precision: 0.94 Precision: 0.99 Precision: 0.98
Recall: 0.91 Recall: 0.99 Recall: 0.98
F1-score: 0.92 F1-score: 0.99 F1-score: 0.98
We identified 3 categories
of misclassified bots
M. Golzadeh, A. Decan, D. Legay, and T. Mens, “A ground-truthdataset and classification model for detecting bots in GitHub issueand PR comments,”Journal of Systems and Software [Submitted for review], 2020.
7/14
10. Detecting and Characterizing
Bots that Commit Code[2]
2- T. Dey, S. Mousavi, E. Ponce, T. Fry, B. Vasilescu, A. Filipova, andA. Mockus, “Detecting and characterizing bots that commit code,” inInt’l Conf. Mining Software Repositories, 2020.
Ground-truth dataset of 13,150 bots and 13,150 humans
BIMAN
is validated on a dataset of 67 bots and 67 humans
58 cases were correctly detected (87%) AUC-ROC: 0.89
BIN
The presence of the
string “bot” at the end
of the author name
Precision: 0.99
Recall: 0.37
BIM
Commit messages
AUC-ROC:0.70
Precision:0.57
Recall:0.67
BICA
Files changed by each commit
The projects that commit is associated with
timestamp and time zone of the commits
AUC-ROC: 0.89
BIMAN
9/14
11. Evaluationofthemodels Ground-truth dataset
Activities of an account in a single repository
Only consider contributors that have at least 10 commit messages
3,380 3,542
Compute features for each contributor
Model trained on PR and issues comments
Model trained on commit messages
Precision: 0.76 Precision: 0.78 Precision: 0.77
Recall: 0.78 Recall: 0.76 Recall: 0.77
F1-score: 0.77 F1-score: 0.77 F1-score: 0.77
Precision: 0.82 Precision: 0.78 Precision: 0.80
Recall: 0.75 Recall: 0.84 Recall: 0.80
F1-score: 0.78 F1-score: 0.81 F1-score: 0.80 10/14
12. A bot detector based on git commits
Command-line tool in Python.
Given a git repository, predicts the type of
Authors/Committers, Export to CSV or JSON
https://github.com/mehdigolzadeh/BoDeGiC
[ …. ]
Inputs BoDeGic Output
Extract
Git commit
messages
Extracting
features
Pre-trained
classifier
Git
repository
Authors
list
Parameters
Bot prediction
[ ]
BoDeGiC
11/14
13. Discussions and Threats
Better definition of
“what a bot is” is required
Identify bots not at the level of a contributor
but at the level of activities
The ground-truth dataset
A source of threat to construct validity
19 out of 100 cases wrongly labeled in the original dataset
13 Bots 6 Humans
Presence of mixed accounts
12/14
14. • Detecting bots is important to avoid bias in socio-technical studies
• An approach and a model to identify bots
• To which extent this approach performs on git commit messages
• Evaluation of the existing model
• Train a new classifier on commit messages and evaluate the approach
• A tool to identify bots in git repositories
• Substitute BIM by our classification model in the BIMAN approach[2]
Sum up
2- T. Dey, S. Mousavi, E. Ponce, T. Fry, B. Vasilescu, A. Filipova, andA. Mockus, “Detecting and characterizing bots that commit code,” inInt’l Conf. Mining Software Repositories, 2020.
13/14