Evaluating Bot Detection Models on Code Commits

Evaluating a bot detection model
on git commit messages
Mehdi Golzadeh Alexandre Decan Tom Mens
University of Mons - Belgium
BENEVOL 2020
THE 19TH BELGIUM-NETHERLANDS-LUXEMBURG SOFTWARE EVOLUTION WORKSHOP

Socio-technical background
1/14

Detecting bots in distributed
software development
activities is very important
identity merging, team productivity, development effort,
developer onboarding or abandonment, …
Pull requests are more than twice as likely to get accepted if bots are
present in their commenting activity
2/14

Idea behind the study
Mean distance between comments
3/14

Idea behind the study
c c
c
c c
c
c
c
Comment patterns
4/14

Bot identification
Available ground-truth
datasets
Download comments from GitHub
136,529 repositories
10,874,611 issue and PR commentsSelect an account
Rate (by 2nd rater)Rate (by 1st rater)
Discussion
Rate (3rd rater)
[disagreement]
[difficult case]
[disagreement]
[agreement][agreement]
[include account]
[exclude account]
Current methods and Tools
5/14

Classification of accounts
Number of
comments
Grid-search cross-validation,
Random forest classifier – Best score
k-nearest neighbors
Support vector machine
Decision tree classifier
Logistic regression
Number of empty
comments
Comment
patterns
Inequality between
comments in patterns
Ground truth dataset
of 527 bots and 4473 humans
An approach based on characteristics of comments
6/14

Evaluation of the classifier
We identified 4 categories
of misclassified humans
Classification model for detecting bots
Based on issue and PR comments
Precision: 0.94 Precision: 0.99 Precision: 0.98
Recall: 0.91 Recall: 0.99 Recall: 0.98
F1-score: 0.92 F1-score: 0.99 F1-score: 0.98
We identified 3 categories
of misclassified bots
M. Golzadeh, A. Decan, D. Legay, and T. Mens, “A ground-truthdataset and classification model for detecting bots in GitHub issueand PR comments,”Journal of Systems and Software [Submitted for review], 2020.
7/14

Detecting and Characterizing
Bots that Commit Code[2]
2- T. Dey, S. Mousavi, E. Ponce, T. Fry, B. Vasilescu, A. Filipova, andA. Mockus, “Detecting and characterizing bots that commit code,” inInt’l Conf. Mining Software Repositories, 2020.
Ground-truth dataset of 13,150 bots and 13,150 humans
BIMAN
is validated on a dataset of 67 bots and 67 humans
58 cases were correctly detected (87%) AUC-ROC: 0.89
BIN
The presence of the
string “bot” at the end
of the author name
Precision: 0.99
Recall: 0.37
BIM
Commit messages
AUC-ROC:0.70
Precision:0.57
Recall:0.67
BICA
Files changed by each commit
The projects that commit is associated with
timestamp and time zone of the commits
AUC-ROC: 0.89
BIMAN
9/14

Evaluationofthemodels Ground-truth dataset
Activities of an account in a single repository
Only consider contributors that have at least 10 commit messages
3,380 3,542
Compute features for each contributor
Model trained on PR and issues comments
Model trained on commit messages
F1-score: 0.77 F1-score: 0.77 F1-score: 0.77
F1-score: 0.78 F1-score: 0.81 F1-score: 0.80 10/14

A bot detector based on git commits
Command-line tool in Python.
Given a git repository, predicts the type of
Authors/Committers, Export to CSV or JSON
https://github.com/mehdigolzadeh/BoDeGiC
[ …. ]
Inputs BoDeGic Output
Extract
Git commit
messages
Extracting
features
Pre-trained
classifier
Git
repository
Authors
list
Parameters
Bot prediction
[ ]
BoDeGiC
11/14

Discussions and Threats
Better definition of
“what a bot is” is required
Identify bots not at the level of a contributor
but at the level of activities
The ground-truth dataset
A source of threat to construct validity
19 out of 100 cases wrongly labeled in the original dataset
13 Bots 6 Humans
Presence of mixed accounts
12/14

• Detecting bots is important to avoid bias in socio-technical studies
• An approach and a model to identify bots
• To which extent this approach performs on git commit messages
• Evaluation of the existing model
• Train a new classifier on commit messages and evaluate the approach
• A tool to identify bots in git repositories
• Substitute BIM by our classification model in the BIMAN approach[2]
Sum up
2- T. Dey, S. Mousavi, E. Ponce, T. Fry, B. Vasilescu, A. Filipova, andA. Mockus, “Detecting and characterizing bots that commit code,” inInt’l Conf. Mining Software Repositories, 2020.
13/14

The FNRS-FWO Excellence of Science research project SECO-ASSIST
Thank you
14/14

Evaluating Bot Detection Models on Code Commits

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Ähnlich wie Evaluating Bot Detection Models on Code Commits

Ähnlich wie Evaluating Bot Detection Models on Code Commits (20)

Mehr von Tom Mens

Mehr von Tom Mens (20)

Kürzlich hochgeladen

Kürzlich hochgeladen (20)

Evaluating Bot Detection Models on Code Commits