Distributed Version Control systems (DVCs) are replacing centralized version systems (CVCs). Our research is trying to uncover the way that DVCs are being used, and how its use differs from CVCs. For the last 23 months we have been mining Linux git repositories in an attempt to understand how they use their DVCs (git). I will describe the challenges of this mining, and provide an overview of how linux uses DVCs.
10. Super-repository
●
Collection of repositories cloned (recursively)
from the same repo
–
At least one per developer
●
–
At least one public repository
●
–
In their personal computer
The blessed
In git, no way to trace them
11. Moving commits across the
superRepo
Method
Push
Pull
Email
Done at source, needs write access to source
Done at destination, needs read access to source
Source creates patch, recipient applies it
12. Merging in DCVs
●
If not all commits in destination
–
Create temp
branch
–
Copy commits
●
Merge locally
●
If created,
–
delete temp
branch
22. ContinuousMining of Linux
●
Linux has no centralized logging
–
–
●
Nobody really knows what the superRepo is
Commits flow without any event broadcasting
mechanism
Where do we find the activity?
–
Repos
–
Commits
23. Repos and Committers
●
Most repos will have a known set of persons
committing to them
–
Simplest case: its owner is the only committer
–
Extreme case: repo is used as centralized version
control system: everybody commits to it
24. Semiautomatic Process
●
Every 3 hrs, ask every repo:
–
What new commits do you have?
–
What commits did you delete?
–
Automatically resolve propagations
●
●
Commits might propagate before we scan
Daily:
–
Are commits in repo by unknown committers?
●
Answer:
–
is there a new repo? or is committer new to repo?
25. Implementation
●
Running since Nov. 2011
–
Currently scans 650 repos every 3 hrs
–
Retrieved
●
●
2.3 million commits (compared to 400k in Linus repo)
109 million records in propagation table
<commit-id, added|deleted, repo, when>
27. Is one better than the other?
●
RQ1
–
●
Does continuousMining uncover a larger
development ecosystem than snapMining?
RQ2
–
Does continuousMining expose any missing
information, or bias in the recorded history of the
project recovered using snapMining?
28. Snapshot (Linus)
No Repos
Continuous
1
479
Commits
64k
533k
Non-merge Commits
59k
485k
Unique Non-merges
58k
135k
98.9%
27.9%
%unique non-merges
Non-merges that reached Blessed
43.1%
Different authors emails
3434
5646
Different authors
2883
4575
Different committers emails
283
1185
Different committers
245
1058
33. Arrival of Commits at Blessed...
●
We can classify patches as a new feature or
bug-fix
34. So what? (the reviewer will ask)
●
What can we do with this data?
–
For researchers: enable empirical studies of
activities previously invisible
–
For practitioners: Implement traceability of
●
●
Commits and
Repos
37. The Repos
●
●
●
X: activity (in commits)
Y: ratio of commits accepted by
Linus to total commits
Shape:
–
–
●
Triangles: official repos
Circles: non-official repos
Size:
–
–
●
Smaller: consume commits
Larger: produce commits
Color: merge/commit ratio
–
Grey: never merge
–
“Cooler”: high ratio
–
“Warmer”: lower ratio
42. Linux Dashboard
●
We asked two linux maintainers:
–
●
Can this info be useful?
Answer:
–
“Yes”
… but not for what we expected...
43. Tracking commits in Linux
●
Need to track patches, not commits
–
Particularly important in consumer repositories
–
Need to cross-reference commits
●
●
–
What commits contain the same patch?
What commits are mentioned in the log?
Some repos track commits from blessed via
cherry-picking
●
●
Commit ids are useless
So they annotate log with the origin commit id
45. Has it reached linux-next before
blessed?
●
Commits should pass through linux-next
before arriving at blessed.
●
If not, potential issue
●
Hard to do with current tools:
●
Patches change commit id