SlideShare ist ein Scribd-Unternehmen logo
1 von 30
Downloaden Sie, um offline zu lesen
Dr. Markus Scheidgen
Model-based Analysis of
Large Scale Software
Repositories
■ problem
■ creating models of software repositories
■ the means for analyzing such models
■ example analysis
1
Problem
2
Is Software Engineering a Science?
■ Def.: Science (from Latin scientia) is a systematic enterprise that
builds and organizes knowledge in the form of testable
explanations and predictions about the universe.
■ Testable? Example theses:
★ DSLs allow domain experts to develop software effectively and more
efficiently as with GPLs.
★ Static type systems lead to safer programming and fewer bugs.
★ Functional programming leads to less performant programs.
★ Scrum allows to develop programs faster.
★ My framework allows to develop ... more, faster ... with less, fewer ...
■ Methods for quantitative measures of software properties
(metrics) are mostly used to assess the state of software projects,
and rarely for empirical studies on software engineering itself
3
Reasons
4
inaccessibility •new methods have to be used first to produce data
•industry cooperations necessary
•open-source repositories are a possibility
data quality •not easy to distinguish between written code, generated code,
test code
•there are maintained projects, developed projects, aborted
projects
heterogeneity •different project structures
•different paradigms
•different languages
•different APIs
amounts of data •source forge hosts >350.000 projects
•current snap-shop of linux kernel contains 108 AST-nodes
•EMF´s 50 MB Git repository, takes 20 GB of binary encoded
AST data
Relevant Fields with Partial Solutions
5
Mining Software Repositories
(MSR)
Software Metrics Reverse Engineering
analyzing of rich data contained in
software engineering related
repositories such as version control
systems, mailing list, bug-tracking
systems
definition, acquisition, and analysis of
quantitative measures of certain
software properties
analyzing existing code bases to create
representations at a higher level of
abstraction (models)
• guiding software development
• defect detection, prediction,
resolution
• gaining actionable knowledge about
software projects and software
engineering methodologies
• assessment of engineering costs for
development, change, maintenance,
etc.
• comparative analysis of software
systems or analysis of software
evolution
• comparative analysis of software
engineering methodologies
• understanding existing software for
development, change, maintenance,
etc.
• derive AST, UML, or KDM models
from software
• static language independent
• syntax based
• scale: single projects, large scale
(eclipse, apache), ultra large scale
(source forge, git-hub)
• language independent (e.g. LOC)
• syntax based (e.g. McCabe)
• static, dynamic (evolution)
• syntax (structure, behavior)
• semantics
Problem Statement: Everything is there,
but ...
1.Missing abstractions:
■ no general abstractions to cover multiple languages/
repositories are used
■ only proprietary solutions and systems tailored for specific
algorithms/databases, languages, repositories
2.Scalability is an issue:
■ for ultra large scale repositories only VCS meta-data is used
■ for large scale repositories only language independent analysis
on file-based granularity possible
■ only for single software projects language dependent analysis
on AST-level detail are feasible
6
Proposed Solution: Scalable Model-based
Framework
■ Meta-model and reverse engineering based approach to
analyze code-models on different and well-defined levels of
abstractions instead of the code itself.
■ Query and transformation languages as well as model
persistence based on the Map/Reduce BigData paradigm.
■ Target: AST-level analysis of large-scale repositories, e.g.
git.eclipse.org (>300 projects)
7
SrcRepo: A Framework for Large
Scale Repository Analysis
8
Model-based Analysis of Large Scale
Software Repositories
9
Model-based Analysis of Large Scale
Software Repositories
9
VCS
Model-based Analysis of Large Scale
Software Repositories
9
1):
Reverse
Engineering to
create AST-level
models of
software and its
evolution
VCS Model
Model-based Analysis of Large Scale
Software Repositories
9
1):
Reverse
Engineering to
create AST-level
models of
software and its
evolution
VCS Model
2):
Transformations
based on MSR
Algorithms
to derive implicit
dependencies
Model-based Analysis of Large Scale
Software Repositories
9
1):
Reverse
Engineering to
create AST-level
models of
software and its
evolution
VCS Model
2):
Transformations
based on MSR
Algorithms
to derive implicit
dependencies
Metrics
2):
Queries to
perform
measurements
based on
structural, causal,
and implicit
dependencies
VCS Model MetricsVCS Model Metrics
Model-based Analysis of Large Scale
Software Repositories
9
1):
Reverse
Engineering to
create AST-level
models of
software and its
evolution
VCS Model
2):
Transformations
based on MSR
Algorithms
to derive implicit
dependencies
Metrics
2):
Queries to
perform
measurements
based on
structural, causal,
and implicit
dependencies
3): Statistical analysis
Better
Understanding
Software
Engineering
1) Reverse Engineering Software in Version
Control Systems (VCS)
10
code code
code
code code
code
code code code
revisions
files
causalrelations
structural relations
Code in a VCS Software Model
1) Models of Source Repositories
(github.com/markus1978/srcrepo)
11
SrcRepoSrcRepo
EMF/EMF-
Fragments
EMF CompareEMF Compare
EMF/EMF-
Fragments
jGit MoDisco
EMF/EMF-
Fragments
git repository with Java sourcesgit repository with Java sourcesgit repository with Java sources
1) Models of Source Repositories
(github.com/markus1978/srcrepo)
12
A B C
A
A B
A D
PB1.R1
B1.R2
B1.R3
B1.R4
B2.R1
B2.R2
A
A B
Repository
Revision Diff
Compilation
Unit
Model
Package Class
...
* * * *
*
1
prevnext
JGit MoDisco
modelmetamodel
usageIn
Package
Access
*
package1
«relation,
fragmentation»
«fragmentation» «relation,
fragmentation»
«relation»
«fragmentation»
* *
extends1
1) Models of Source Repositories: Scalability
SrcRepo is based on EMF-Fragments
(https://github.com/markus1978/emf-fragments)
13
map/reduce
(hadoop)
“Share Nothing” Nodes Cluster
DFS
(HDFS)
key-value-store (EMF-resources)
(hbase)
structured data (EMF-model)model transformations
2) Scala for queries and transformations:
Syntax (internal DSL: from OCL to Scala)
14
Filip Krikava: Enriching EMF Models with Scala (quick overview), Eclipse Summit, Oct 24 2012
2) Scala for Queries: Syntax
def	
  exists(predicate:	
  (E)	
  =>	
  Boolean):	
  Boolean
def	
  forAll(predicate:	
  (E)	
  =>	
  Boolean):	
  Boolean
def	
  select(predicate:	
  (E)	
  =>	
  Boolean):	
  Collection[E]
def	
  reject(predicate:	
  (E)	
  =>	
  Boolean):	
  Collection[E]
def	
  collect[R](expr:	
  (E)	
  =>	
  R):	
  Collection[R]
def	
  collectAll[R](expr:	
  (E)	
  =>	
  Collection[R]):	
  Collection[R]
def	
  closure(expr:	
  (E)	
  =>	
  Collection[E]):	
  Collection[E]
def	
  aggregate[R](expr:	
  (E)	
  =>	
  R,	
  start:	
  ()	
  =>	
  R,	
  aggr:	
  (R,	
  R)	
  =>	
  R):	
  R
def	
  sum(expr:	
  (E)	
  =>	
  Double):	
  Double
def	
  product(expr:	
  (E)	
  =>	
  Double):	
  Double
def	
  max(expr:	
  (E)	
  =>	
  Double):	
  Double
def	
  min(expr:	
  (E)	
  =>	
  Double):	
  Double
def	
  average(expr:	
  (E)	
  =>	
  Double):	
  Double
...
def	
  run(runnable:	
  (E)	
  =>	
  Unit):	
  Unit
15
2) Scala for Queries: Syntax
■ example SrcRepo query: “average number of methods per
class”
def	
  avgMethodsPerClass(self:	
  Model)	
  =	
  {	
  
val	
  packages	
  =	
  self.getOwnedPackages().
	
  	
  closure((p)=>p.getOwnedPackages());
	
  	
  val	
  classes	
  =	
  packages.collect((p)=>p.getOwnedClasses()).
	
  	
  	
  	
  closure((c)=>c.getInnerClasses());
	
  	
  return	
  classes.average((c)=>c.getOwnedMethods().size());
}
16
2) Scala and internal DSLs: Semantics
■Three different semantics, one interface
■ immediate collection
■ lazy iterator
■ Map/Reduce database
17
Example Analysis
18
First Example Case Study: Structured Design
Matrices (DSM) and Propagation costs
19
Alan MacCormack, John Rusnak, Carliss Baldwin: Exploring the Structure of Complex Software
Designs, Journal of the institute of operations research and management science, 2006
First Example Case Study: Structured Design
Matrices (DSM) and Propagation costs
19
Alan MacCormack, John Rusnak, Carliss Baldwin: Exploring the Structure of Complex Software
Designs, Journal of the institute of operations research and management science, 2006
Second Example Case Study: Detecting
Cross-Cutting Concerns
20
Silvia Breu, Thomas Zimmermann, Christian Lindig: Mining Eclipse for Cross-Cutting Concerns,
MSR´06, Shanghai, 2006
■ The same set of methods called from different locations
within the same transaction (commits in a small time-
window by the same committer) indicate the introduction
for a cross-cutting concern.
Second Example Case Study: Detecting
Cross-Cutting Concerns
20
Silvia Breu, Thomas Zimmermann, Christian Lindig: Mining Eclipse for Cross-Cutting Concerns,
MSR´06, Shanghai, 2006
■ The same set of methods called from different locations
within the same transaction (commits in a small time-
window by the same committer) indicate the introduction
for a cross-cutting concern.
Second Example Case Study: Detecting
Cross-Cutting Concerns
20
Silvia Breu, Thomas Zimmermann, Christian Lindig: Mining Eclipse for Cross-Cutting Concerns,
MSR´06, Shanghai, 2006
■ The same set of methods called from different locations
within the same transaction (commits in a small time-
window by the same committer) indicate the introduction
for a cross-cutting concern.
Second Example Case Study: Detecting
Cross-Cutting Concerns
20
Silvia Breu, Thomas Zimmermann, Christian Lindig: Mining Eclipse for Cross-Cutting Concerns,
MSR´06, Shanghai, 2006
■ The same set of methods called from different locations
within the same transaction (commits in a small time-
window by the same committer) indicate the introduction
for a cross-cutting concern.
Summary
21
VCS Model MetricsVCS Model Metrics
1):
Reverse
Engineering to
create AST-level
models of
software and its
evolution
VCS Model
2):
Transformations
based on MSR
Algorithms
to derive implicit
dependencies
Metrics
2):
Queries to
perform
measurements
based on
structural, causal,
and implicit
dependencies
Statistical analysis
Better
Understanding
Software
Engineering

Weitere ähnliche Inhalte

Was ist angesagt?

Ontology-based data access: why it is so cool!
Ontology-based data access: why it is so cool!Ontology-based data access: why it is so cool!
Ontology-based data access: why it is so cool!Josef Hardi
 
Ontologies Ontop Databases
Ontologies Ontop DatabasesOntologies Ontop Databases
Ontologies Ontop DatabasesMartín Rezk
 
Stanford'12 Intro to Ontology Based Data Access for RDBMS through query rewri...
Stanford'12 Intro to Ontology Based Data Access for RDBMS through query rewri...Stanford'12 Intro to Ontology Based Data Access for RDBMS through query rewri...
Stanford'12 Intro to Ontology Based Data Access for RDBMS through query rewri...Mariano Rodriguez-Muro
 
A Taxonomy for Program Metamodels in Program Reverse Engineering
A Taxonomy for Program Metamodels in Program Reverse EngineeringA Taxonomy for Program Metamodels in Program Reverse Engineering
A Taxonomy for Program Metamodels in Program Reverse EngineeringHironori Washizaki
 
ICWE2017 BigDataEurope
ICWE2017 BigDataEuropeICWE2017 BigDataEurope
ICWE2017 BigDataEuropeBigData_Europe
 
A hybrid model to detect malicious executables
A hybrid model to detect malicious executablesA hybrid model to detect malicious executables
A hybrid model to detect malicious executablesUltraUploader
 
Ontop: Answering SPARQL Queries over Relational Databases
Ontop: Answering SPARQL Queries over Relational DatabasesOntop: Answering SPARQL Queries over Relational Databases
Ontop: Answering SPARQL Queries over Relational DatabasesGuohui Xiao
 
An Empirical Study of Refactorings and Technical Debt in Machine Learning Sys...
An Empirical Study of Refactorings and Technical Debt in Machine Learning Sys...An Empirical Study of Refactorings and Technical Debt in Machine Learning Sys...
An Empirical Study of Refactorings and Technical Debt in Machine Learning Sys...Raffi Khatchadourian
 
A Platform for Application Risk Intelligence
A Platform for Application Risk IntelligenceA Platform for Application Risk Intelligence
A Platform for Application Risk IntelligenceCheckmarx
 
Opal Hermes - towards representative benchmarks
Opal  Hermes - towards representative benchmarksOpal  Hermes - towards representative benchmarks
Opal Hermes - towards representative benchmarksMichaelEichberg1
 
Floss Metrics 2009
Floss Metrics 2009Floss Metrics 2009
Floss Metrics 2009Inria
 
20100309 03 - Vulnerability analysis (McCabe)
20100309 03 - Vulnerability analysis (McCabe)20100309 03 - Vulnerability analysis (McCabe)
20100309 03 - Vulnerability analysis (McCabe)LeClubQualiteLogicielle
 
fUML-Driven Performance Analysis through the MOSES Model Library
fUML-Driven Performance Analysisthrough the MOSES Model LibraryfUML-Driven Performance Analysisthrough the MOSES Model Library
fUML-Driven Performance Analysis through the MOSES Model LibraryLuca Berardinelli
 
Analyzing Changes in Software Systems From ChangeDistiller to FMDiff
Analyzing Changes in Software Systems From ChangeDistiller to FMDiffAnalyzing Changes in Software Systems From ChangeDistiller to FMDiff
Analyzing Changes in Software Systems From ChangeDistiller to FMDiffMartin Pinzger
 
Semantic Web and Related Work at W3C
Semantic Web and Related Work at W3CSemantic Web and Related Work at W3C
Semantic Web and Related Work at W3CIvan Herman
 

Was ist angesagt? (20)

Icpc11c.ppt
Icpc11c.pptIcpc11c.ppt
Icpc11c.ppt
 
Ontology-based data access: why it is so cool!
Ontology-based data access: why it is so cool!Ontology-based data access: why it is so cool!
Ontology-based data access: why it is so cool!
 
Ontologies Ontop Databases
Ontologies Ontop DatabasesOntologies Ontop Databases
Ontologies Ontop Databases
 
Stanford'12 Intro to Ontology Based Data Access for RDBMS through query rewri...
Stanford'12 Intro to Ontology Based Data Access for RDBMS through query rewri...Stanford'12 Intro to Ontology Based Data Access for RDBMS through query rewri...
Stanford'12 Intro to Ontology Based Data Access for RDBMS through query rewri...
 
A Taxonomy for Program Metamodels in Program Reverse Engineering
A Taxonomy for Program Metamodels in Program Reverse EngineeringA Taxonomy for Program Metamodels in Program Reverse Engineering
A Taxonomy for Program Metamodels in Program Reverse Engineering
 
ExSchema
ExSchemaExSchema
ExSchema
 
ICWE2017 BigDataEurope
ICWE2017 BigDataEuropeICWE2017 BigDataEurope
ICWE2017 BigDataEurope
 
ontop: A tutorial
ontop: A tutorialontop: A tutorial
ontop: A tutorial
 
A hybrid model to detect malicious executables
A hybrid model to detect malicious executablesA hybrid model to detect malicious executables
A hybrid model to detect malicious executables
 
Ontop: Answering SPARQL Queries over Relational Databases
Ontop: Answering SPARQL Queries over Relational DatabasesOntop: Answering SPARQL Queries over Relational Databases
Ontop: Answering SPARQL Queries over Relational Databases
 
Results of the FLOSSMetrics project
Results of the FLOSSMetrics projectResults of the FLOSSMetrics project
Results of the FLOSSMetrics project
 
An Empirical Study of Refactorings and Technical Debt in Machine Learning Sys...
An Empirical Study of Refactorings and Technical Debt in Machine Learning Sys...An Empirical Study of Refactorings and Technical Debt in Machine Learning Sys...
An Empirical Study of Refactorings and Technical Debt in Machine Learning Sys...
 
A Platform for Application Risk Intelligence
A Platform for Application Risk IntelligenceA Platform for Application Risk Intelligence
A Platform for Application Risk Intelligence
 
Opal Hermes - towards representative benchmarks
Opal  Hermes - towards representative benchmarksOpal  Hermes - towards representative benchmarks
Opal Hermes - towards representative benchmarks
 
Icsme16.ppt
Icsme16.pptIcsme16.ppt
Icsme16.ppt
 
Floss Metrics 2009
Floss Metrics 2009Floss Metrics 2009
Floss Metrics 2009
 
20100309 03 - Vulnerability analysis (McCabe)
20100309 03 - Vulnerability analysis (McCabe)20100309 03 - Vulnerability analysis (McCabe)
20100309 03 - Vulnerability analysis (McCabe)
 
fUML-Driven Performance Analysis through the MOSES Model Library
fUML-Driven Performance Analysisthrough the MOSES Model LibraryfUML-Driven Performance Analysisthrough the MOSES Model Library
fUML-Driven Performance Analysis through the MOSES Model Library
 
Analyzing Changes in Software Systems From ChangeDistiller to FMDiff
Analyzing Changes in Software Systems From ChangeDistiller to FMDiffAnalyzing Changes in Software Systems From ChangeDistiller to FMDiff
Analyzing Changes in Software Systems From ChangeDistiller to FMDiff
 
Semantic Web and Related Work at W3C
Semantic Web and Related Work at W3CSemantic Web and Related Work at W3C
Semantic Web and Related Work at W3C
 

Ähnlich wie Model-based Analysis of Large Scale Software Repositories

Scalable constrained spectral clustering
Scalable constrained spectral clusteringScalable constrained spectral clustering
Scalable constrained spectral clusteringNishanth Harapanahalli
 
Open source evolution analysis
Open source evolution analysisOpen source evolution analysis
Open source evolution analysisIzzat Alsmadi
 
Software Analytics: Data Analytics for Software Engineering and Security
Software Analytics: Data Analytics for Software Engineering and SecuritySoftware Analytics: Data Analytics for Software Engineering and Security
Software Analytics: Data Analytics for Software Engineering and SecurityTao Xie
 
On Modeling and Testing When Unpredictability Becomes the Pattern (April 2nd,...
On Modeling and Testing When Unpredictability Becomes the Pattern (April 2nd,...On Modeling and Testing When Unpredictability Becomes the Pattern (April 2nd,...
On Modeling and Testing When Unpredictability Becomes the Pattern (April 2nd,...Benoit Combemale
 
A novel approach based on topic
A novel approach based on topicA novel approach based on topic
A novel approach based on topiccsandit
 
Large Language Models for Test Case Evolution and Repair
Large Language Models for Test Case Evolution and RepairLarge Language Models for Test Case Evolution and Repair
Large Language Models for Test Case Evolution and RepairLionel Briand
 
Predicting reliability of software systems under development
Predicting reliability of software systems under developmentPredicting reliability of software systems under development
Predicting reliability of software systems under developmentRAKESH RANA
 
Put Your Hands in the Mud: What Technique, Why, and How
Put Your Hands in the Mud: What Technique, Why, and HowPut Your Hands in the Mud: What Technique, Why, and How
Put Your Hands in the Mud: What Technique, Why, and HowMassimiliano Di Penta
 
AudrisMockus_MSR22.pdf
AudrisMockus_MSR22.pdfAudrisMockus_MSR22.pdf
AudrisMockus_MSR22.pdfTapajitDey1
 
Towards Reusable Research Software
Towards Reusable Research SoftwareTowards Reusable Research Software
Towards Reusable Research Softwaredgarijo
 
Automating the Generation of Benchmark Suites
Automating the Generation of Benchmark SuitesAutomating the Generation of Benchmark Suites
Automating the Generation of Benchmark SuitesBen Hermann
 
Software Analytics: Towards Software Mining that Matters (2014)
Software Analytics:Towards Software Mining that Matters (2014)Software Analytics:Towards Software Mining that Matters (2014)
Software Analytics: Towards Software Mining that Matters (2014)Tao Xie
 
Preventive Software Maintenance: The Past, the Present, the Future
Preventive Software Maintenance: The Past, the Present, the FuturePreventive Software Maintenance: The Past, the Present, the Future
Preventive Software Maintenance: The Past, the Present, the FutureNikolaos Tsantalis
 
Runtime Behavior of JavaScript Programs
Runtime Behavior of JavaScript ProgramsRuntime Behavior of JavaScript Programs
Runtime Behavior of JavaScript ProgramsIRJET Journal
 
TMPA-2017: Stemming Architectural Decay in Software Systems
TMPA-2017:  Stemming Architectural Decay in Software SystemsTMPA-2017:  Stemming Architectural Decay in Software Systems
TMPA-2017: Stemming Architectural Decay in Software SystemsIosif Itkin
 
A Survey on Design Pattern Detection Approaches
A Survey on Design Pattern Detection ApproachesA Survey on Design Pattern Detection Approaches
A Survey on Design Pattern Detection ApproachesCSCJournals
 

Ähnlich wie Model-based Analysis of Large Scale Software Repositories (20)

SE1.ppt
SE1.pptSE1.ppt
SE1.ppt
 
Scalable constrained spectral clustering
Scalable constrained spectral clusteringScalable constrained spectral clustering
Scalable constrained spectral clustering
 
Open source evolution analysis
Open source evolution analysisOpen source evolution analysis
Open source evolution analysis
 
Software Analytics: Data Analytics for Software Engineering and Security
Software Analytics: Data Analytics for Software Engineering and SecuritySoftware Analytics: Data Analytics for Software Engineering and Security
Software Analytics: Data Analytics for Software Engineering and Security
 
On Modeling and Testing When Unpredictability Becomes the Pattern (April 2nd,...
On Modeling and Testing When Unpredictability Becomes the Pattern (April 2nd,...On Modeling and Testing When Unpredictability Becomes the Pattern (April 2nd,...
On Modeling and Testing When Unpredictability Becomes the Pattern (April 2nd,...
 
A novel approach based on topic
A novel approach based on topicA novel approach based on topic
A novel approach based on topic
 
poster_3.0
poster_3.0poster_3.0
poster_3.0
 
Large Language Models for Test Case Evolution and Repair
Large Language Models for Test Case Evolution and RepairLarge Language Models for Test Case Evolution and Repair
Large Language Models for Test Case Evolution and Repair
 
Predicting reliability of software systems under development
Predicting reliability of software systems under developmentPredicting reliability of software systems under development
Predicting reliability of software systems under development
 
Put Your Hands in the Mud: What Technique, Why, and How
Put Your Hands in the Mud: What Technique, Why, and HowPut Your Hands in the Mud: What Technique, Why, and How
Put Your Hands in the Mud: What Technique, Why, and How
 
Saner16b.ppt
Saner16b.pptSaner16b.ppt
Saner16b.ppt
 
Saner16b.ppt
Saner16b.pptSaner16b.ppt
Saner16b.ppt
 
AudrisMockus_MSR22.pdf
AudrisMockus_MSR22.pdfAudrisMockus_MSR22.pdf
AudrisMockus_MSR22.pdf
 
Towards Reusable Research Software
Towards Reusable Research SoftwareTowards Reusable Research Software
Towards Reusable Research Software
 
Automating the Generation of Benchmark Suites
Automating the Generation of Benchmark SuitesAutomating the Generation of Benchmark Suites
Automating the Generation of Benchmark Suites
 
Software Analytics: Towards Software Mining that Matters (2014)
Software Analytics:Towards Software Mining that Matters (2014)Software Analytics:Towards Software Mining that Matters (2014)
Software Analytics: Towards Software Mining that Matters (2014)
 
Preventive Software Maintenance: The Past, the Present, the Future
Preventive Software Maintenance: The Past, the Present, the FuturePreventive Software Maintenance: The Past, the Present, the Future
Preventive Software Maintenance: The Past, the Present, the Future
 
Runtime Behavior of JavaScript Programs
Runtime Behavior of JavaScript ProgramsRuntime Behavior of JavaScript Programs
Runtime Behavior of JavaScript Programs
 
TMPA-2017: Stemming Architectural Decay in Software Systems
TMPA-2017:  Stemming Architectural Decay in Software SystemsTMPA-2017:  Stemming Architectural Decay in Software Systems
TMPA-2017: Stemming Architectural Decay in Software Systems
 
A Survey on Design Pattern Detection Approaches
A Survey on Design Pattern Detection ApproachesA Survey on Design Pattern Detection Approaches
A Survey on Design Pattern Detection Approaches
 

Kürzlich hochgeladen

Story boards and shot lists for my a level piece
Story boards and shot lists for my a level pieceStory boards and shot lists for my a level piece
Story boards and shot lists for my a level piececharlottematthew16
 
Powerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time ClashPowerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time Clashcharlottematthew16
 
Vertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsVertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsMiki Katsuragi
 
Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Manik S Magar
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsMark Billinghurst
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Enterprise Knowledge
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Mattias Andersson
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity PlanDatabarracks
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLScyllaDB
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationSlibray Presentation
 
Artificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxArtificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxhariprasad279825
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfAlex Barbosa Coqueiro
 
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo DayH2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo DaySri Ambati
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfAddepto
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek SchlawackFwdays
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):comworks
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 3652toLead Limited
 
DSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningDSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningLars Bell
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 

Kürzlich hochgeladen (20)

Story boards and shot lists for my a level piece
Story boards and shot lists for my a level pieceStory boards and shot lists for my a level piece
Story boards and shot lists for my a level piece
 
Powerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time ClashPowerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time Clash
 
Vertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsVertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering Tips
 
DMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special EditionDMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special Edition
 
Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR Systems
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity Plan
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQL
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck Presentation
 
Artificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxArtificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptx
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdf
 
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo DayH2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdf
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365
 
DSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningDSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine Tuning
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 

Model-based Analysis of Large Scale Software Repositories

  • 1. Dr. Markus Scheidgen Model-based Analysis of Large Scale Software Repositories ■ problem ■ creating models of software repositories ■ the means for analyzing such models ■ example analysis 1
  • 3. Is Software Engineering a Science? ■ Def.: Science (from Latin scientia) is a systematic enterprise that builds and organizes knowledge in the form of testable explanations and predictions about the universe. ■ Testable? Example theses: ★ DSLs allow domain experts to develop software effectively and more efficiently as with GPLs. ★ Static type systems lead to safer programming and fewer bugs. ★ Functional programming leads to less performant programs. ★ Scrum allows to develop programs faster. ★ My framework allows to develop ... more, faster ... with less, fewer ... ■ Methods for quantitative measures of software properties (metrics) are mostly used to assess the state of software projects, and rarely for empirical studies on software engineering itself 3
  • 4. Reasons 4 inaccessibility •new methods have to be used first to produce data •industry cooperations necessary •open-source repositories are a possibility data quality •not easy to distinguish between written code, generated code, test code •there are maintained projects, developed projects, aborted projects heterogeneity •different project structures •different paradigms •different languages •different APIs amounts of data •source forge hosts >350.000 projects •current snap-shop of linux kernel contains 108 AST-nodes •EMF´s 50 MB Git repository, takes 20 GB of binary encoded AST data
  • 5. Relevant Fields with Partial Solutions 5 Mining Software Repositories (MSR) Software Metrics Reverse Engineering analyzing of rich data contained in software engineering related repositories such as version control systems, mailing list, bug-tracking systems definition, acquisition, and analysis of quantitative measures of certain software properties analyzing existing code bases to create representations at a higher level of abstraction (models) • guiding software development • defect detection, prediction, resolution • gaining actionable knowledge about software projects and software engineering methodologies • assessment of engineering costs for development, change, maintenance, etc. • comparative analysis of software systems or analysis of software evolution • comparative analysis of software engineering methodologies • understanding existing software for development, change, maintenance, etc. • derive AST, UML, or KDM models from software • static language independent • syntax based • scale: single projects, large scale (eclipse, apache), ultra large scale (source forge, git-hub) • language independent (e.g. LOC) • syntax based (e.g. McCabe) • static, dynamic (evolution) • syntax (structure, behavior) • semantics
  • 6. Problem Statement: Everything is there, but ... 1.Missing abstractions: ■ no general abstractions to cover multiple languages/ repositories are used ■ only proprietary solutions and systems tailored for specific algorithms/databases, languages, repositories 2.Scalability is an issue: ■ for ultra large scale repositories only VCS meta-data is used ■ for large scale repositories only language independent analysis on file-based granularity possible ■ only for single software projects language dependent analysis on AST-level detail are feasible 6
  • 7. Proposed Solution: Scalable Model-based Framework ■ Meta-model and reverse engineering based approach to analyze code-models on different and well-defined levels of abstractions instead of the code itself. ■ Query and transformation languages as well as model persistence based on the Map/Reduce BigData paradigm. ■ Target: AST-level analysis of large-scale repositories, e.g. git.eclipse.org (>300 projects) 7
  • 8. SrcRepo: A Framework for Large Scale Repository Analysis 8
  • 9. Model-based Analysis of Large Scale Software Repositories 9
  • 10. Model-based Analysis of Large Scale Software Repositories 9 VCS
  • 11. Model-based Analysis of Large Scale Software Repositories 9 1): Reverse Engineering to create AST-level models of software and its evolution VCS Model
  • 12. Model-based Analysis of Large Scale Software Repositories 9 1): Reverse Engineering to create AST-level models of software and its evolution VCS Model 2): Transformations based on MSR Algorithms to derive implicit dependencies
  • 13. Model-based Analysis of Large Scale Software Repositories 9 1): Reverse Engineering to create AST-level models of software and its evolution VCS Model 2): Transformations based on MSR Algorithms to derive implicit dependencies Metrics 2): Queries to perform measurements based on structural, causal, and implicit dependencies
  • 14. VCS Model MetricsVCS Model Metrics Model-based Analysis of Large Scale Software Repositories 9 1): Reverse Engineering to create AST-level models of software and its evolution VCS Model 2): Transformations based on MSR Algorithms to derive implicit dependencies Metrics 2): Queries to perform measurements based on structural, causal, and implicit dependencies 3): Statistical analysis Better Understanding Software Engineering
  • 15. 1) Reverse Engineering Software in Version Control Systems (VCS) 10 code code code code code code code code code revisions files causalrelations structural relations Code in a VCS Software Model
  • 16. 1) Models of Source Repositories (github.com/markus1978/srcrepo) 11 SrcRepoSrcRepo EMF/EMF- Fragments EMF CompareEMF Compare EMF/EMF- Fragments jGit MoDisco EMF/EMF- Fragments git repository with Java sourcesgit repository with Java sourcesgit repository with Java sources
  • 17. 1) Models of Source Repositories (github.com/markus1978/srcrepo) 12 A B C A A B A D PB1.R1 B1.R2 B1.R3 B1.R4 B2.R1 B2.R2 A A B Repository Revision Diff Compilation Unit Model Package Class ... * * * * * 1 prevnext JGit MoDisco modelmetamodel usageIn Package Access * package1 «relation, fragmentation» «fragmentation» «relation, fragmentation» «relation» «fragmentation» * * extends1
  • 18. 1) Models of Source Repositories: Scalability SrcRepo is based on EMF-Fragments (https://github.com/markus1978/emf-fragments) 13 map/reduce (hadoop) “Share Nothing” Nodes Cluster DFS (HDFS) key-value-store (EMF-resources) (hbase) structured data (EMF-model)model transformations
  • 19. 2) Scala for queries and transformations: Syntax (internal DSL: from OCL to Scala) 14 Filip Krikava: Enriching EMF Models with Scala (quick overview), Eclipse Summit, Oct 24 2012
  • 20. 2) Scala for Queries: Syntax def  exists(predicate:  (E)  =>  Boolean):  Boolean def  forAll(predicate:  (E)  =>  Boolean):  Boolean def  select(predicate:  (E)  =>  Boolean):  Collection[E] def  reject(predicate:  (E)  =>  Boolean):  Collection[E] def  collect[R](expr:  (E)  =>  R):  Collection[R] def  collectAll[R](expr:  (E)  =>  Collection[R]):  Collection[R] def  closure(expr:  (E)  =>  Collection[E]):  Collection[E] def  aggregate[R](expr:  (E)  =>  R,  start:  ()  =>  R,  aggr:  (R,  R)  =>  R):  R def  sum(expr:  (E)  =>  Double):  Double def  product(expr:  (E)  =>  Double):  Double def  max(expr:  (E)  =>  Double):  Double def  min(expr:  (E)  =>  Double):  Double def  average(expr:  (E)  =>  Double):  Double ... def  run(runnable:  (E)  =>  Unit):  Unit 15
  • 21. 2) Scala for Queries: Syntax ■ example SrcRepo query: “average number of methods per class” def  avgMethodsPerClass(self:  Model)  =  {   val  packages  =  self.getOwnedPackages().    closure((p)=>p.getOwnedPackages());    val  classes  =  packages.collect((p)=>p.getOwnedClasses()).        closure((c)=>c.getInnerClasses());    return  classes.average((c)=>c.getOwnedMethods().size()); } 16
  • 22. 2) Scala and internal DSLs: Semantics ■Three different semantics, one interface ■ immediate collection ■ lazy iterator ■ Map/Reduce database 17
  • 24. First Example Case Study: Structured Design Matrices (DSM) and Propagation costs 19 Alan MacCormack, John Rusnak, Carliss Baldwin: Exploring the Structure of Complex Software Designs, Journal of the institute of operations research and management science, 2006
  • 25. First Example Case Study: Structured Design Matrices (DSM) and Propagation costs 19 Alan MacCormack, John Rusnak, Carliss Baldwin: Exploring the Structure of Complex Software Designs, Journal of the institute of operations research and management science, 2006
  • 26. Second Example Case Study: Detecting Cross-Cutting Concerns 20 Silvia Breu, Thomas Zimmermann, Christian Lindig: Mining Eclipse for Cross-Cutting Concerns, MSR´06, Shanghai, 2006 ■ The same set of methods called from different locations within the same transaction (commits in a small time- window by the same committer) indicate the introduction for a cross-cutting concern.
  • 27. Second Example Case Study: Detecting Cross-Cutting Concerns 20 Silvia Breu, Thomas Zimmermann, Christian Lindig: Mining Eclipse for Cross-Cutting Concerns, MSR´06, Shanghai, 2006 ■ The same set of methods called from different locations within the same transaction (commits in a small time- window by the same committer) indicate the introduction for a cross-cutting concern.
  • 28. Second Example Case Study: Detecting Cross-Cutting Concerns 20 Silvia Breu, Thomas Zimmermann, Christian Lindig: Mining Eclipse for Cross-Cutting Concerns, MSR´06, Shanghai, 2006 ■ The same set of methods called from different locations within the same transaction (commits in a small time- window by the same committer) indicate the introduction for a cross-cutting concern.
  • 29. Second Example Case Study: Detecting Cross-Cutting Concerns 20 Silvia Breu, Thomas Zimmermann, Christian Lindig: Mining Eclipse for Cross-Cutting Concerns, MSR´06, Shanghai, 2006 ■ The same set of methods called from different locations within the same transaction (commits in a small time- window by the same committer) indicate the introduction for a cross-cutting concern.
  • 30. Summary 21 VCS Model MetricsVCS Model Metrics 1): Reverse Engineering to create AST-level models of software and its evolution VCS Model 2): Transformations based on MSR Algorithms to derive implicit dependencies Metrics 2): Queries to perform measurements based on structural, causal, and implicit dependencies Statistical analysis Better Understanding Software Engineering