SlideShare ist ein Scribd-Unternehmen logo
1 von 17
Downloaden Sie, um offline zu lesen
Technische Universität München
Scalable Semantic Version Control
for Linked Data Management
Claudius Hauptmann, Michele Brocco and Wolfgang Wörndl
Technische Universität München
2nd Workshop on Linked Data Quality
at ESWC 2015 (June 1st, 2015 - Portorož, Slovenia)
Technische Universität München
2
Version Control for Linked Data - Goal
• Scalable (reduce disk space-, CPU- and memory
consumption, network traffic and disk I/O)
• Semantic (data about versioning accessible via SPARQL
queries and OWL reasoners)
Technische Universität München
3
Query Types
• Cross-version queries ("Which triples were modified last
month that are related to Portorož?")
• Targeted queries ("What did we know about Portorož in
version X ?")
Technische Universität München
4
Update Types
• Creation of branches
• Modification of triples (working on a branch)
• Review and correction of modifications
• Commit of modifications (creation of a version)
• Merging of branches
• Deletion of branches
Technische Universität München
5
Storage Strategies for Versioned Triples
• Version-based
• Delta-based
• Hybrid
• Hypergraph-based
• Partial Order Index
Technische Universität München
6
Delta-Based Storage Schema
Technische Universität München
7
Partial On-Demand Reconstruction of Historic Versions
Technische Universität München
8
Partial On-Demand Reconstruction of Historic Versions
Technische Universität München
9
Query Engine Optimization
• Arbitrary length path operator is inefficient (many lookup
operations)
• Replace slow operators by new operators, that are
optimized for partial on-demand version reconstruction
• Use in-memory indices for commits
• Cache indices
Technische Universität München
10
Step 1: Loading commits into in-memory index
• Caching for queries on same graph
Technische Universität München
11
Step 2: Planning Commit Graph Traversal
• Caching for queries on same graph and same version
Technische Universität München
12
Step 3: Loading changes for chunk of triples
• Load relationships between triples and commits:
• Commit-triple index
• Triple-commit index
Technische Universität München
13
Step 4: Traverse Commit Graph + Test Triples
• Start traversal at commit specified in query
• Get changes for commit from index
• Check if add or delete
• Save result for triple
• Remove triple from indices
• If indices not empty go to next commit and repeat
Technische Universität München
14
Evaluation - Dataset
• DBpedia class assertions (version 2014)
• 28,031,852 triples
• 3 datasets: 280,319, 28,032 and 2,804 commits (triples
equally distributed over commits, 1 branch, no deletes)
• Base line: query with 3 arbitrary length path operators
• Repeated 100x (base line 10x), no caching
• Test queries:
Technische Universität München
15
Evaluation - Response Time Query 1 (7 results)
#commits 2,804 28,032 280,319
Baseline 30,537 ms 304,474 ms 3,061,018 ms
Optimized 126 ms 318 ms 3,110 ms
Loading commits 15.15 ms 162.5 ms 2,352 ms
Creating Plan 0.65 ms 11.6 ms 178 ms
Loading changes 14.87 ms 15.6 ms 16 ms
Traversal 0.39 ms 4.7 ms 60 ms
Technische Universität München
16
Evaluation - Response Time Query 2 (3108 results)
#commits 2,804 28,032 280,319
Baseline 57.009 ms 607.924 ms OutOfMemoryError
Optimized 792 ms 1,188 ms 5,910 ms
Loading commits 10.95 ms 160.6 ms 2,326 ms
Creating Plan 0.60 ms 11.9 ms 175 ms
Loading changes 610.57 ms 616.5 ms 649 ms
Traversal 13.54 ms 160.6 ms 2,000 ms
Technische Universität München
17
Conclusion and Outlook
• Conclusion: Delta-based storage strategies can be used
for datasets with millions of triples and thousands of
versions for targeted queries by partial on-demand
version reconstruction and query engine optimization
• Outlook:
• Optimization of implementation
• Evaluation with established benchmarks
• Evaluation of caching strategies
• Evaluation with hybrid storage strategies
• Integration into existing systems (e.g. r43ples)

Weitere ähnliche Inhalte

Ähnlich wie Scalable Semantic Version Control for Linked Data Management (presented at 2nd Workshop on Linked Data Quality at ESWC 2015 in Portorož, Slovenia)

VL/HCC 2014 - A Longitudinal Study of Programmers' Backtracking
VL/HCC 2014 - A Longitudinal Study of Programmers' BacktrackingVL/HCC 2014 - A Longitudinal Study of Programmers' Backtracking
VL/HCC 2014 - A Longitudinal Study of Programmers' Backtracking
YoungSeok Yoon
 
Machine Learning At Speed: Operationalizing ML For Real-Time Data Streams
Machine Learning At Speed: Operationalizing ML For Real-Time Data StreamsMachine Learning At Speed: Operationalizing ML For Real-Time Data Streams
Machine Learning At Speed: Operationalizing ML For Real-Time Data Streams
Lightbend
 
RFonseca HEV Intern Presentation
RFonseca HEV Intern PresentationRFonseca HEV Intern Presentation
RFonseca HEV Intern Presentation
Rogelio Fonseca
 
Improving Resource Availability in Data Center using Deep Learning.pdf
Improving Resource Availability in Data Center using Deep Learning.pdfImproving Resource Availability in Data Center using Deep Learning.pdf
Improving Resource Availability in Data Center using Deep Learning.pdf
Kundjanasith Thonglek
 
MSPresentation_Spring2011
MSPresentation_Spring2011MSPresentation_Spring2011
MSPresentation_Spring2011
Shaun Smith
 

Ähnlich wie Scalable Semantic Version Control for Linked Data Management (presented at 2nd Workshop on Linked Data Quality at ESWC 2015 in Portorož, Slovenia) (20)

VL/HCC 2014 - A Longitudinal Study of Programmers' Backtracking
VL/HCC 2014 - A Longitudinal Study of Programmers' BacktrackingVL/HCC 2014 - A Longitudinal Study of Programmers' Backtracking
VL/HCC 2014 - A Longitudinal Study of Programmers' Backtracking
 
Machine Learning At Speed: Operationalizing ML For Real-Time Data Streams
Machine Learning At Speed: Operationalizing ML For Real-Time Data StreamsMachine Learning At Speed: Operationalizing ML For Real-Time Data Streams
Machine Learning At Speed: Operationalizing ML For Real-Time Data Streams
 
2019 GDRR: Blockchain Data Analytics - QuTrack: Model Life Cycle Management f...
2019 GDRR: Blockchain Data Analytics - QuTrack: Model Life Cycle Management f...2019 GDRR: Blockchain Data Analytics - QuTrack: Model Life Cycle Management f...
2019 GDRR: Blockchain Data Analytics - QuTrack: Model Life Cycle Management f...
 
QuTrack: Model Life Cycle Management for AI and ML models using a Blockchain ...
QuTrack: Model Life Cycle Management for AI and ML models using a Blockchain ...QuTrack: Model Life Cycle Management for AI and ML models using a Blockchain ...
QuTrack: Model Life Cycle Management for AI and ML models using a Blockchain ...
 
IncQuery-D: Distributed Incremental Model Queries over the Cloud: Engineerin...
IncQuery-D: Distributed Incremental Model Queries over the Cloud: Engineerin...IncQuery-D: Distributed Incremental Model Queries over the Cloud: Engineerin...
IncQuery-D: Distributed Incremental Model Queries over the Cloud: Engineerin...
 
RFonseca HEV Intern Presentation
RFonseca HEV Intern PresentationRFonseca HEV Intern Presentation
RFonseca HEV Intern Presentation
 
Improving Resource Availability in Data Center using Deep Learning.pdf
Improving Resource Availability in Data Center using Deep Learning.pdfImproving Resource Availability in Data Center using Deep Learning.pdf
Improving Resource Availability in Data Center using Deep Learning.pdf
 
MSPresentation_Spring2011
MSPresentation_Spring2011MSPresentation_Spring2011
MSPresentation_Spring2011
 
Building highly reliable data pipeline @datadog par Quentin François
Building highly reliable data pipeline @datadog par Quentin FrançoisBuilding highly reliable data pipeline @datadog par Quentin François
Building highly reliable data pipeline @datadog par Quentin François
 
Automated Discovery of Performance Regressions in Enterprise Applications
Automated Discovery of Performance Regressions in Enterprise ApplicationsAutomated Discovery of Performance Regressions in Enterprise Applications
Automated Discovery of Performance Regressions in Enterprise Applications
 
how to build a Length of Stay model for a ProofOfConcept project
how to build a Length of Stay model for a ProofOfConcept projecthow to build a Length of Stay model for a ProofOfConcept project
how to build a Length of Stay model for a ProofOfConcept project
 
Basic use of xcms
Basic use of xcmsBasic use of xcms
Basic use of xcms
 
dipLODocus[RDF]: Short and Long-Tail RDF Analytics for Massive Webs of Data
dipLODocus[RDF]: Short and Long-Tail RDF Analytics for Massive Webs of DatadipLODocus[RDF]: Short and Long-Tail RDF Analytics for Massive Webs of Data
dipLODocus[RDF]: Short and Long-Tail RDF Analytics for Massive Webs of Data
 
Executing Provenance-Enabled Queries over Web Data
Executing Provenance-Enabled Queries over Web DataExecuting Provenance-Enabled Queries over Web Data
Executing Provenance-Enabled Queries over Web Data
 
ReComp, the complete story: an invited talk at Cardiff University
ReComp, the complete story:  an invited talk at Cardiff UniversityReComp, the complete story:  an invited talk at Cardiff University
ReComp, the complete story: an invited talk at Cardiff University
 
An Introduction to Clinical Study Migrations
An Introduction to Clinical Study MigrationsAn Introduction to Clinical Study Migrations
An Introduction to Clinical Study Migrations
 
Model Based Test Validation and Oracles for Data Acquisition Systems
Model Based Test Validation and Oracles for Data Acquisition SystemsModel Based Test Validation and Oracles for Data Acquisition Systems
Model Based Test Validation and Oracles for Data Acquisition Systems
 
Deep Learning Initiative @ NECSTLab
Deep Learning Initiative @ NECSTLabDeep Learning Initiative @ NECSTLab
Deep Learning Initiative @ NECSTLab
 
AnalyticOps: Lessons Learned Moving Machine-Learning Algorithms to Production...
AnalyticOps: Lessons Learned Moving Machine-Learning Algorithms to Production...AnalyticOps: Lessons Learned Moving Machine-Learning Algorithms to Production...
AnalyticOps: Lessons Learned Moving Machine-Learning Algorithms to Production...
 
Maestro_Abstract
Maestro_AbstractMaestro_Abstract
Maestro_Abstract
 

Kürzlich hochgeladen

Pests of cotton_Borer_Pests_Binomics_Dr.UPR.pdf
Pests of cotton_Borer_Pests_Binomics_Dr.UPR.pdfPests of cotton_Borer_Pests_Binomics_Dr.UPR.pdf
Pests of cotton_Borer_Pests_Binomics_Dr.UPR.pdf
PirithiRaju
 
Module for Grade 9 for Asynchronous/Distance learning
Module for Grade 9 for Asynchronous/Distance learningModule for Grade 9 for Asynchronous/Distance learning
Module for Grade 9 for Asynchronous/Distance learning
levieagacer
 
Formation of low mass protostars and their circumstellar disks
Formation of low mass protostars and their circumstellar disksFormation of low mass protostars and their circumstellar disks
Formation of low mass protostars and their circumstellar disks
Sérgio Sacani
 
Chemical Tests; flame test, positive and negative ions test Edexcel Internati...
Chemical Tests; flame test, positive and negative ions test Edexcel Internati...Chemical Tests; flame test, positive and negative ions test Edexcel Internati...
Chemical Tests; flame test, positive and negative ions test Edexcel Internati...
ssuser79fe74
 
Biogenic Sulfur Gases as Biosignatures on Temperate Sub-Neptune Waterworlds
Biogenic Sulfur Gases as Biosignatures on Temperate Sub-Neptune WaterworldsBiogenic Sulfur Gases as Biosignatures on Temperate Sub-Neptune Waterworlds
Biogenic Sulfur Gases as Biosignatures on Temperate Sub-Neptune Waterworlds
Sérgio Sacani
 
dkNET Webinar "Texera: A Scalable Cloud Computing Platform for Sharing Data a...
dkNET Webinar "Texera: A Scalable Cloud Computing Platform for Sharing Data a...dkNET Webinar "Texera: A Scalable Cloud Computing Platform for Sharing Data a...
dkNET Webinar "Texera: A Scalable Cloud Computing Platform for Sharing Data a...
dkNET
 
Pests of cotton_Sucking_Pests_Dr.UPR.pdf
Pests of cotton_Sucking_Pests_Dr.UPR.pdfPests of cotton_Sucking_Pests_Dr.UPR.pdf
Pests of cotton_Sucking_Pests_Dr.UPR.pdf
PirithiRaju
 

Kürzlich hochgeladen (20)

GBSN - Biochemistry (Unit 1)
GBSN - Biochemistry (Unit 1)GBSN - Biochemistry (Unit 1)
GBSN - Biochemistry (Unit 1)
 
Pests of cotton_Borer_Pests_Binomics_Dr.UPR.pdf
Pests of cotton_Borer_Pests_Binomics_Dr.UPR.pdfPests of cotton_Borer_Pests_Binomics_Dr.UPR.pdf
Pests of cotton_Borer_Pests_Binomics_Dr.UPR.pdf
 
Module for Grade 9 for Asynchronous/Distance learning
Module for Grade 9 for Asynchronous/Distance learningModule for Grade 9 for Asynchronous/Distance learning
Module for Grade 9 for Asynchronous/Distance learning
 
Formation of low mass protostars and their circumstellar disks
Formation of low mass protostars and their circumstellar disksFormation of low mass protostars and their circumstellar disks
Formation of low mass protostars and their circumstellar disks
 
Chemical Tests; flame test, positive and negative ions test Edexcel Internati...
Chemical Tests; flame test, positive and negative ions test Edexcel Internati...Chemical Tests; flame test, positive and negative ions test Edexcel Internati...
Chemical Tests; flame test, positive and negative ions test Edexcel Internati...
 
Call Girls Alandi Call Me 7737669865 Budget Friendly No Advance Booking
Call Girls Alandi Call Me 7737669865 Budget Friendly No Advance BookingCall Girls Alandi Call Me 7737669865 Budget Friendly No Advance Booking
Call Girls Alandi Call Me 7737669865 Budget Friendly No Advance Booking
 
IDENTIFICATION OF THE LIVING- forensic medicine
IDENTIFICATION OF THE LIVING- forensic medicineIDENTIFICATION OF THE LIVING- forensic medicine
IDENTIFICATION OF THE LIVING- forensic medicine
 
9999266834 Call Girls In Noida Sector 22 (Delhi) Call Girl Service
9999266834 Call Girls In Noida Sector 22 (Delhi) Call Girl Service9999266834 Call Girls In Noida Sector 22 (Delhi) Call Girl Service
9999266834 Call Girls In Noida Sector 22 (Delhi) Call Girl Service
 
COMPUTING ANTI-DERIVATIVES (Integration by SUBSTITUTION)
COMPUTING ANTI-DERIVATIVES(Integration by SUBSTITUTION)COMPUTING ANTI-DERIVATIVES(Integration by SUBSTITUTION)
COMPUTING ANTI-DERIVATIVES (Integration by SUBSTITUTION)
 
Vip profile Call Girls In Lonavala 9748763073 For Genuine Sex Service At Just...
Vip profile Call Girls In Lonavala 9748763073 For Genuine Sex Service At Just...Vip profile Call Girls In Lonavala 9748763073 For Genuine Sex Service At Just...
Vip profile Call Girls In Lonavala 9748763073 For Genuine Sex Service At Just...
 
Dopamine neurotransmitter determination using graphite sheet- graphene nano-s...
Dopamine neurotransmitter determination using graphite sheet- graphene nano-s...Dopamine neurotransmitter determination using graphite sheet- graphene nano-s...
Dopamine neurotransmitter determination using graphite sheet- graphene nano-s...
 
CELL -Structural and Functional unit of life.pdf
CELL -Structural and Functional unit of life.pdfCELL -Structural and Functional unit of life.pdf
CELL -Structural and Functional unit of life.pdf
 
PSYCHOSOCIAL NEEDS. in nursing II sem pptx
PSYCHOSOCIAL NEEDS. in nursing II sem pptxPSYCHOSOCIAL NEEDS. in nursing II sem pptx
PSYCHOSOCIAL NEEDS. in nursing II sem pptx
 
Biogenic Sulfur Gases as Biosignatures on Temperate Sub-Neptune Waterworlds
Biogenic Sulfur Gases as Biosignatures on Temperate Sub-Neptune WaterworldsBiogenic Sulfur Gases as Biosignatures on Temperate Sub-Neptune Waterworlds
Biogenic Sulfur Gases as Biosignatures on Temperate Sub-Neptune Waterworlds
 
Unit5-Cloud.pptx for lpu course cse121 o
Unit5-Cloud.pptx for lpu course cse121 oUnit5-Cloud.pptx for lpu course cse121 o
Unit5-Cloud.pptx for lpu course cse121 o
 
dkNET Webinar "Texera: A Scalable Cloud Computing Platform for Sharing Data a...
dkNET Webinar "Texera: A Scalable Cloud Computing Platform for Sharing Data a...dkNET Webinar "Texera: A Scalable Cloud Computing Platform for Sharing Data a...
dkNET Webinar "Texera: A Scalable Cloud Computing Platform for Sharing Data a...
 
Factory Acceptance Test( FAT).pptx .
Factory Acceptance Test( FAT).pptx       .Factory Acceptance Test( FAT).pptx       .
Factory Acceptance Test( FAT).pptx .
 
Forensic Biology & Its biological significance.pdf
Forensic Biology & Its biological significance.pdfForensic Biology & Its biological significance.pdf
Forensic Biology & Its biological significance.pdf
 
Pests of cotton_Sucking_Pests_Dr.UPR.pdf
Pests of cotton_Sucking_Pests_Dr.UPR.pdfPests of cotton_Sucking_Pests_Dr.UPR.pdf
Pests of cotton_Sucking_Pests_Dr.UPR.pdf
 
Pulmonary drug delivery system M.pharm -2nd sem P'ceutics
Pulmonary drug delivery system M.pharm -2nd sem P'ceuticsPulmonary drug delivery system M.pharm -2nd sem P'ceutics
Pulmonary drug delivery system M.pharm -2nd sem P'ceutics
 

Scalable Semantic Version Control for Linked Data Management (presented at 2nd Workshop on Linked Data Quality at ESWC 2015 in Portorož, Slovenia)

  • 1. Technische Universität München Scalable Semantic Version Control for Linked Data Management Claudius Hauptmann, Michele Brocco and Wolfgang Wörndl Technische Universität München 2nd Workshop on Linked Data Quality at ESWC 2015 (June 1st, 2015 - Portorož, Slovenia)
  • 2. Technische Universität München 2 Version Control for Linked Data - Goal • Scalable (reduce disk space-, CPU- and memory consumption, network traffic and disk I/O) • Semantic (data about versioning accessible via SPARQL queries and OWL reasoners)
  • 3. Technische Universität München 3 Query Types • Cross-version queries ("Which triples were modified last month that are related to Portorož?") • Targeted queries ("What did we know about Portorož in version X ?")
  • 4. Technische Universität München 4 Update Types • Creation of branches • Modification of triples (working on a branch) • Review and correction of modifications • Commit of modifications (creation of a version) • Merging of branches • Deletion of branches
  • 5. Technische Universität München 5 Storage Strategies for Versioned Triples • Version-based • Delta-based • Hybrid • Hypergraph-based • Partial Order Index
  • 7. Technische Universität München 7 Partial On-Demand Reconstruction of Historic Versions
  • 8. Technische Universität München 8 Partial On-Demand Reconstruction of Historic Versions
  • 9. Technische Universität München 9 Query Engine Optimization • Arbitrary length path operator is inefficient (many lookup operations) • Replace slow operators by new operators, that are optimized for partial on-demand version reconstruction • Use in-memory indices for commits • Cache indices
  • 10. Technische Universität München 10 Step 1: Loading commits into in-memory index • Caching for queries on same graph
  • 11. Technische Universität München 11 Step 2: Planning Commit Graph Traversal • Caching for queries on same graph and same version
  • 12. Technische Universität München 12 Step 3: Loading changes for chunk of triples • Load relationships between triples and commits: • Commit-triple index • Triple-commit index
  • 13. Technische Universität München 13 Step 4: Traverse Commit Graph + Test Triples • Start traversal at commit specified in query • Get changes for commit from index • Check if add or delete • Save result for triple • Remove triple from indices • If indices not empty go to next commit and repeat
  • 14. Technische Universität München 14 Evaluation - Dataset • DBpedia class assertions (version 2014) • 28,031,852 triples • 3 datasets: 280,319, 28,032 and 2,804 commits (triples equally distributed over commits, 1 branch, no deletes) • Base line: query with 3 arbitrary length path operators • Repeated 100x (base line 10x), no caching • Test queries:
  • 15. Technische Universität München 15 Evaluation - Response Time Query 1 (7 results) #commits 2,804 28,032 280,319 Baseline 30,537 ms 304,474 ms 3,061,018 ms Optimized 126 ms 318 ms 3,110 ms Loading commits 15.15 ms 162.5 ms 2,352 ms Creating Plan 0.65 ms 11.6 ms 178 ms Loading changes 14.87 ms 15.6 ms 16 ms Traversal 0.39 ms 4.7 ms 60 ms
  • 16. Technische Universität München 16 Evaluation - Response Time Query 2 (3108 results) #commits 2,804 28,032 280,319 Baseline 57.009 ms 607.924 ms OutOfMemoryError Optimized 792 ms 1,188 ms 5,910 ms Loading commits 10.95 ms 160.6 ms 2,326 ms Creating Plan 0.60 ms 11.9 ms 175 ms Loading changes 610.57 ms 616.5 ms 649 ms Traversal 13.54 ms 160.6 ms 2,000 ms
  • 17. Technische Universität München 17 Conclusion and Outlook • Conclusion: Delta-based storage strategies can be used for datasets with millions of triples and thousands of versions for targeted queries by partial on-demand version reconstruction and query engine optimization • Outlook: • Optimization of implementation • Evaluation with established benchmarks • Evaluation of caching strategies • Evaluation with hybrid storage strategies • Integration into existing systems (e.g. r43ples)