Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Model-based Analysis of Large Scale Software Repositories
1. Dr. Markus Scheidgen
Model-based Analysis of
Large Scale Software
Repositories
■ problem
■ creating models of software repositories
■ the means for analyzing such models
■ example analysis
1
3. Is Software Engineering a Science?
■ Def.: Science (from Latin scientia) is a systematic enterprise that
builds and organizes knowledge in the form of testable
explanations and predictions about the universe.
■ Testable? Example theses:
★ DSLs allow domain experts to develop software effectively and more
efficiently as with GPLs.
★ Static type systems lead to safer programming and fewer bugs.
★ Functional programming leads to less performant programs.
★ Scrum allows to develop programs faster.
★ My framework allows to develop ... more, faster ... with less, fewer ...
■ Methods for quantitative measures of software properties
(metrics) are mostly used to assess the state of software projects,
and rarely for empirical studies on software engineering itself
3
4. Reasons
4
inaccessibility •new methods have to be used first to produce data
•industry cooperations necessary
•open-source repositories are a possibility
data quality •not easy to distinguish between written code, generated code,
test code
•there are maintained projects, developed projects, aborted
projects
heterogeneity •different project structures
•different paradigms
•different languages
•different APIs
amounts of data •source forge hosts >350.000 projects
•current snap-shop of linux kernel contains 108 AST-nodes
•EMF´s 50 MB Git repository, takes 20 GB of binary encoded
AST data
5. Relevant Fields with Partial Solutions
5
Mining Software Repositories
(MSR)
Software Metrics Reverse Engineering
analyzing of rich data contained in
software engineering related
repositories such as version control
systems, mailing list, bug-tracking
systems
definition, acquisition, and analysis of
quantitative measures of certain
software properties
analyzing existing code bases to create
representations at a higher level of
abstraction (models)
• guiding software development
• defect detection, prediction,
resolution
• gaining actionable knowledge about
software projects and software
engineering methodologies
• assessment of engineering costs for
development, change, maintenance,
etc.
• comparative analysis of software
systems or analysis of software
evolution
• comparative analysis of software
engineering methodologies
• understanding existing software for
development, change, maintenance,
etc.
• derive AST, UML, or KDM models
from software
• static language independent
• syntax based
• scale: single projects, large scale
(eclipse, apache), ultra large scale
(source forge, git-hub)
• language independent (e.g. LOC)
• syntax based (e.g. McCabe)
• static, dynamic (evolution)
• syntax (structure, behavior)
• semantics
6. Problem Statement: Everything is there,
but ...
1.Missing abstractions:
■ no general abstractions to cover multiple languages/
repositories are used
■ only proprietary solutions and systems tailored for specific
algorithms/databases, languages, repositories
2.Scalability is an issue:
■ for ultra large scale repositories only VCS meta-data is used
■ for large scale repositories only language independent analysis
on file-based granularity possible
■ only for single software projects language dependent analysis
on AST-level detail are feasible
6
7. Proposed Solution: Scalable Model-based
Framework
■ Meta-model and reverse engineering based approach to
analyze code-models on different and well-defined levels of
abstractions instead of the code itself.
■ Query and transformation languages as well as model
persistence based on the Map/Reduce BigData paradigm.
■ Target: AST-level analysis of large-scale repositories, e.g.
git.eclipse.org (>300 projects)
7
11. Model-based Analysis of Large Scale
Software Repositories
9
1):
Reverse
Engineering to
create AST-level
models of
software and its
evolution
VCS Model
12. Model-based Analysis of Large Scale
Software Repositories
9
1):
Reverse
Engineering to
create AST-level
models of
software and its
evolution
VCS Model
2):
Transformations
based on MSR
Algorithms
to derive implicit
dependencies
13. Model-based Analysis of Large Scale
Software Repositories
9
1):
Reverse
Engineering to
create AST-level
models of
software and its
evolution
VCS Model
2):
Transformations
based on MSR
Algorithms
to derive implicit
dependencies
Metrics
2):
Queries to
perform
measurements
based on
structural, causal,
and implicit
dependencies
14. VCS Model MetricsVCS Model Metrics
Model-based Analysis of Large Scale
Software Repositories
9
1):
Reverse
Engineering to
create AST-level
models of
software and its
evolution
VCS Model
2):
Transformations
based on MSR
Algorithms
to derive implicit
dependencies
Metrics
2):
Queries to
perform
measurements
based on
structural, causal,
and implicit
dependencies
3): Statistical analysis
Better
Understanding
Software
Engineering
15. 1) Reverse Engineering Software in Version
Control Systems (VCS)
10
code code
code
code code
code
code code code
revisions
files
causalrelations
structural relations
Code in a VCS Software Model
16. 1) Models of Source Repositories
(github.com/markus1978/srcrepo)
11
SrcRepoSrcRepo
EMF/EMF-
Fragments
EMF CompareEMF Compare
EMF/EMF-
Fragments
jGit MoDisco
EMF/EMF-
Fragments
git repository with Java sourcesgit repository with Java sourcesgit repository with Java sources
17. 1) Models of Source Repositories
(github.com/markus1978/srcrepo)
12
A B C
A
A B
A D
PB1.R1
B1.R2
B1.R3
B1.R4
B2.R1
B2.R2
A
A B
Repository
Revision Diff
Compilation
Unit
Model
Package Class
...
* * * *
*
1
prevnext
JGit MoDisco
modelmetamodel
usageIn
Package
Access
*
package1
«relation,
fragmentation»
«fragmentation» «relation,
fragmentation»
«relation»
«fragmentation»
* *
extends1
18. 1) Models of Source Repositories: Scalability
SrcRepo is based on EMF-Fragments
(https://github.com/markus1978/emf-fragments)
13
map/reduce
(hadoop)
“Share Nothing” Nodes Cluster
DFS
(HDFS)
key-value-store (EMF-resources)
(hbase)
structured data (EMF-model)model transformations
19. 2) Scala for queries and transformations:
Syntax (internal DSL: from OCL to Scala)
14
Filip Krikava: Enriching EMF Models with Scala (quick overview), Eclipse Summit, Oct 24 2012
21. 2) Scala for Queries: Syntax
■ example SrcRepo query: “average number of methods per
class”
def
avgMethodsPerClass(self:
Model)
=
{
val
packages
=
self.getOwnedPackages().
closure((p)=>p.getOwnedPackages());
val
classes
=
packages.collect((p)=>p.getOwnedClasses()).
closure((c)=>c.getInnerClasses());
return
classes.average((c)=>c.getOwnedMethods().size());
}
16
24. First Example Case Study: Structured Design
Matrices (DSM) and Propagation costs
19
Alan MacCormack, John Rusnak, Carliss Baldwin: Exploring the Structure of Complex Software
Designs, Journal of the institute of operations research and management science, 2006
25. First Example Case Study: Structured Design
Matrices (DSM) and Propagation costs
19
Alan MacCormack, John Rusnak, Carliss Baldwin: Exploring the Structure of Complex Software
Designs, Journal of the institute of operations research and management science, 2006
26. Second Example Case Study: Detecting
Cross-Cutting Concerns
20
Silvia Breu, Thomas Zimmermann, Christian Lindig: Mining Eclipse for Cross-Cutting Concerns,
MSR´06, Shanghai, 2006
■ The same set of methods called from different locations
within the same transaction (commits in a small time-
window by the same committer) indicate the introduction
for a cross-cutting concern.
27. Second Example Case Study: Detecting
Cross-Cutting Concerns
20
Silvia Breu, Thomas Zimmermann, Christian Lindig: Mining Eclipse for Cross-Cutting Concerns,
MSR´06, Shanghai, 2006
■ The same set of methods called from different locations
within the same transaction (commits in a small time-
window by the same committer) indicate the introduction
for a cross-cutting concern.
28. Second Example Case Study: Detecting
Cross-Cutting Concerns
20
Silvia Breu, Thomas Zimmermann, Christian Lindig: Mining Eclipse for Cross-Cutting Concerns,
MSR´06, Shanghai, 2006
■ The same set of methods called from different locations
within the same transaction (commits in a small time-
window by the same committer) indicate the introduction
for a cross-cutting concern.
29. Second Example Case Study: Detecting
Cross-Cutting Concerns
20
Silvia Breu, Thomas Zimmermann, Christian Lindig: Mining Eclipse for Cross-Cutting Concerns,
MSR´06, Shanghai, 2006
■ The same set of methods called from different locations
within the same transaction (commits in a small time-
window by the same committer) indicate the introduction
for a cross-cutting concern.
30. Summary
21
VCS Model MetricsVCS Model Metrics
1):
Reverse
Engineering to
create AST-level
models of
software and its
evolution
VCS Model
2):
Transformations
based on MSR
Algorithms
to derive implicit
dependencies
Metrics
2):
Queries to
perform
measurements
based on
structural, causal,
and implicit
dependencies
Statistical analysis
Better
Understanding
Software
Engineering