SlideShare ist ein Scribd-Unternehmen logo
1 von 33
A Software Framework and Datasets for the
Analysis of Graph Measures on RDF Graphs
Matthäus Zloch1, Maribel Acosta2, Daniel Hienert1,
Stefan Dietze1,3, Stefan Conrad3
1 GESIS - Leibniz-Institute for the Social Sciences, Germany
2 Karlsruhe Institute of Technology, Germany
3 Institute for Computer Science, Heinrich-Heine University, Germany
Motivation
Studying graph topologies is relevant because
 availability and linkage of RDF data sets grow
 various research areas rely on meaningful statistics and
measures
We want to study the topology of RDF graphs
 not at instance- or schema-level
 but about the implicit data structure on RDF data graphs
2
Why studying graph topologies is relevant
Graph-based model of RDF
3
oo o
o o
- # vertices and # edges
- # parallel edges
- density or reciprocity
- degree-based measures
(s, p, o)
s o
p
p
p
p
p
p
p
os
p
Why studying graph topologies is relevant
Research areas that may benefit
 Benchmarking – Designers may use the measures to
generate more representative synthetic datasets
4
Why studying graph topologies is relevant
Research areas that may benefit
 Benchmarking – Designers may use the measures to
generate more representative synthetic datasets
 Sampling – more representative samples in terms of
the structure
5
Why studying graph topologies is relevant
Research areas that may benefit
 Benchmarking – Designers may use the measures to
generate more representative synthetic datasets
 Sampling – more representative samples in terms of
the structure
 Profiling and Evolution – monitor the change in
structure over time (influence vs. prominence)
6
Why studying graph topologies is relevant
Resource Paper Contribution
Our paper introduces two resources
1. An open source framework to acquire, prepare, and perform
analyses of graph-based measures on RDF graphs [1]
2. A dataset of 280 RDF datasets from the LOD Cloud late 2017,
pre-processed and ready to be re-used. Browsable version
available [2]
7
[1] https://github.com/mazlo/lodcc
[2] https://data.gesis.org/lodcc/2017-08
Framework’s Processing Pipeline
8
How to acquire, prepare, and perform a graph-based analysis on RDF
[3] Debattista, J., Lange, C., Auer, S. & Cortis, D. (2018). Evaluating the quality of the LOD cloud:
An empirical investigation.. Semantic Web, 9, 859-901. DOI 10.3233/SW-180306
Dataset’s Metadata Preparation
9
 Optional. Preparation of an offline list of all datasets,
e.g. for parallel processing.
 List should contain all dataset names, the (official)
media type format with URLs, domain class, and
modification date.
How to acquire, prepare, and perform a graph-based analysis on RDF
Graph-Object Preparation
10
 Downloads the dump, extracts*, transforms*, and
groups* RDF files
 N-triples format is used to transform into an edgelist
structure
* if necessary
How to acquire, prepare, and perform a graph-based analysis on RDF
Graph-Object Preparation
11
s o
(s, p, o)
p
<http://../dataset/whisky-circle-info> <http://..title> "Whisky Circle"@en .
How to acquire, prepare, and perform a graph-based analysis on RDF
 As N-Triples
Graph-Object Preparation
 As N-Triples
 use non-cryptographic hashing function to „encode“
the data [3]
12
<http://../dataset/whisky-circle-info> <http://..title> "Whisky Circle"@en .
43f2f4f2e41ae099 c9643559faeed68e 02325f53aeba2f02
(s, p, o)
s o
p
[3] xxhash, https://github.com/Cyan4973/xxHash
How to acquire, prepare, and perform a graph-based analysis on RDF
Graph-Object Preparation
 As N-Triples
 As edgelist
13
(s, p, o)
source vertex target vertex edge-property
43f2f4f2e41ae099 02325f53aeba2f02 c9643559faeed68e
<http://../dataset/whisky-circle-info> <http://..title> "Whisky Circle"@en .
43f2f4f2e41ae099 c9643559faeed68e 02325f53aeba2f02
s o
p
How to acquire, prepare, and perform a graph-based analysis on RDF
Graph-Object Instantiation
14
 Reads edgelist and builds graph structure
 Reports results on measures from 5 dimensions
How to acquire, prepare, and perform a graph-based analysis on RDF
Library re-use
15
How to acquire, prepare, and perform a graph-based analysis on RDF
[4] https://old.datahub.io/dataset/<dataset-name>/datapackage.json
[5] Wget, https://www.gnu.org/software/wget/
[6] dtrx, https://github.com/moonpyk/dtrx
[7] rapper, http://librdf.org/raptor/rapper.html
[8] xxhash, https://github.com/Cyan4973/xxHash
[9] graph-tool, https://graph-tool.skewed.de/
[4]
[6,7,8][9]
[5]
Groups of Measures
Framework reports on 28 measures from 5 groups
16
How to acquire, prepare, and perform a graph-based analysis on RDF
• no. of vertices, edges
• parallel edges
• unique edges
Basic graph
measures
• max-[in|out]-degree
• average degree
• h-index (direct./undirect.)
Degree-based
measures
• graph centralization
• max degree centrality
Centrality
measures
Groups of Measures
Framework reports on 28 measures from 5 groups
17
How to acquire, prepare, and perform a graph-based analysis on RDF
• no. of vertices, edges
• parallel edges
• unique edges
Basic graph
measures
• max-[in|out]-degree
• average degree
• h-index (direct./undirect.)
Degree-based
measures
• graph centralization
• max degree centrality
Centrality
measures
• density
• reciprocity
• diameter
Edge-based
measures
• variance, standard dev., coefficient of var.
• degree-distribution, powerlaw-exponent
alpha
Descriptive stat.
measures
Performance
Example: datasets and sizes
18
How to acquire, prepare, and perform a graph-based analysis on RDF
Performance
Example: datasets and sizes
19
How to acquire, prepare, and perform a graph-based analysis on RDF
Performance
Example: datasets and sizes
Example: runtimes
20
How to acquire, prepare, and perform a graph-based analysis on RDF
Performance
Example: datasets and sizes
Example: runtimes
21
How to acquire, prepare, and perform a graph-based analysis on RDF
22
Datasets from 9 domains in LOD Cloud
 12 in May 2007
 570 in August 2014
 1163 in August 2017
 1224 in August 2018
 1239 in March 2019
A Dataset of Pre-Processed RDF Graphs
A Dataset of Pre-Processed RDF Graphs
 Total of 280 RDF datasets processed and analyzed
 Values for 28 measures per dataset
 Graph-objects ready to be re-used, results as CSV, and
original link to metadata
23
Case Study with Datasets from LOD Cloud
Available at our website https://data.gesis.org/lodcc/2017-08
Graph-based Analysis at large scale
To analyze RDF graphs at large scale you have to
 Download the list of available datasets
 Acquire the datasets
 Represent as a graph-object
 Compute graph measures on that
Sounds easy, right?
24
Case Study with Datasets from LOD Cloud
Graph-based Analysis at large scale
In reality not that easy
 not all data providers offer data dumps
 non-standard media type declarations
 various formats, compressed archives, hierarchies of
files and folders
 erroneous/error-prone data
25
Case Study with Datasets from LOD Cloud
Acquisition and Preparation
26
1163
• metadata packages
890
• 150 different media type statements
• URLs for the official media type statements that are
supported
486
• after filtering 404 and content-type HTML
280
• left out SPARQL-Endpoints
• after graph preparation with corrupt downloads, wrong
media type statements, syntax errors
Case Study with Datasets from LOD Cloud
Processed Datasets by Domain
27
Case Study with Datasets from LOD Cloud
Processed Datasets by Domain
28
Case Study with Datasets from LOD Cloud
 Average degree z seems not affected by number of
edges, in all but Geography and Government
 Average edges per vertex
 Life Sciences: 63.50
 Cross Domain: 5.46
 Average overall domains: 7.9 edges per vertex
29
Preliminary Analysis of Results
Preliminary Analysis of Results
Preliminary Analysis of Results
 hd grows exponentially with number of edges
 Life Sciences and Government are more “dense”
 Linguistics forms two clusters, almost no dependency
to the number of edges, low on avg.
30
Preliminary Analysis of Results
Availability, Maintenance, Sustainability
31
• Framework is published under MIT license on
GitHub. https://github.com/mazlo/lodcc
• Actively used in other research activities.
• Future releases (minor, bugfixes, features)
The framework
• Recalculate for newer versions of the LOD Cloud
• Made available to the community
• Combine with other datasets http://stats.lod2.euThe datasets
Future Work and Research
 Investigate domain- and dataset-specific irregularities
 Derive implications for modelling tasks, on dataset
level and applications like benchmarking
 Offer SPARQL-endpoint to query results
32
Thank you for your attention
[1] https://github.com/mazlo/lodcc
[2] https://data.gesis.org/lodcc/2017-08
@matzlo

Weitere ähnliche Inhalte

Was ist angesagt?

Semantics 2017 - Trying Not to Die Benchmarking using LITMUS
Semantics 2017 - Trying Not to Die Benchmarking using LITMUSSemantics 2017 - Trying Not to Die Benchmarking using LITMUS
Semantics 2017 - Trying Not to Die Benchmarking using LITMUSHarsh Thakkar
 
Achieving time effective federated information from scalable rdf data using s...
Achieving time effective federated information from scalable rdf data using s...Achieving time effective federated information from scalable rdf data using s...
Achieving time effective federated information from scalable rdf data using s...తేజ దండిభట్ల
 
Why R? A Brief Introduction to the Open Source Statistics Platform
Why R? A Brief Introduction to the Open Source Statistics PlatformWhy R? A Brief Introduction to the Open Source Statistics Platform
Why R? A Brief Introduction to the Open Source Statistics PlatformSyracuse University
 
Effective and Efficient Entity Search in RDF data
Effective and Efficient Entity Search in RDF dataEffective and Efficient Entity Search in RDF data
Effective and Efficient Entity Search in RDF dataRoi Blanco
 
R programming Language , Rahul Singh
R programming Language , Rahul SinghR programming Language , Rahul Singh
R programming Language , Rahul SinghRavi Basil
 
Scalable Data Analysis in R -- Lee Edlefsen
Scalable Data Analysis in R -- Lee EdlefsenScalable Data Analysis in R -- Lee Edlefsen
Scalable Data Analysis in R -- Lee EdlefsenRevolution Analytics
 
Information Content based Ranking Metric for Linked Open Vocabularies
Information Content based Ranking Metric for Linked Open VocabulariesInformation Content based Ranking Metric for Linked Open Vocabularies
Information Content based Ranking Metric for Linked Open VocabulariesGhislain Atemezing
 
Employing Graph Databases as a Standardization Model towards Addressing Heter...
Employing Graph Databases as a Standardization Model towards Addressing Heter...Employing Graph Databases as a Standardization Model towards Addressing Heter...
Employing Graph Databases as a Standardization Model towards Addressing Heter...Dippy Aggarwal
 
SPARQL and RDF query optimization
SPARQL and RDF query optimizationSPARQL and RDF query optimization
SPARQL and RDF query optimizationKisung Kim
 
Managing large datasets in R – ff examples and concepts
Managing large datasets in R – ff examples and conceptsManaging large datasets in R – ff examples and concepts
Managing large datasets in R – ff examples and conceptsAjay Ohri
 
The History and Use of R
The History and Use of RThe History and Use of R
The History and Use of RAnalyticsWeek
 
Optimized index structures for querying rdf from the web
Optimized index structures for querying rdf from the webOptimized index structures for querying rdf from the web
Optimized index structures for querying rdf from the webMahdi Atawneh
 
Survey of Graph Indexing
Survey of Graph IndexingSurvey of Graph Indexing
Survey of Graph IndexingKisung Kim
 
9. Searching & Sorting - Data Structures using C++ by Varsha Patil
9. Searching & Sorting - Data Structures using C++ by Varsha Patil9. Searching & Sorting - Data Structures using C++ by Varsha Patil
9. Searching & Sorting - Data Structures using C++ by Varsha Patilwidespreadpromotion
 
Mansi chowkkar programming_in_data_analytics
Mansi chowkkar programming_in_data_analyticsMansi chowkkar programming_in_data_analytics
Mansi chowkkar programming_in_data_analyticsMansiChowkkar
 
Range Query on Big Data Based on Map Reduce
Range Query on Big Data Based on Map ReduceRange Query on Big Data Based on Map Reduce
Range Query on Big Data Based on Map ReduceIJMER
 
Using R to Visualize Spatial Data: R as GIS - Guy Lansley
Using R to Visualize Spatial Data: R as GIS - Guy LansleyUsing R to Visualize Spatial Data: R as GIS - Guy Lansley
Using R to Visualize Spatial Data: R as GIS - Guy LansleyGuy Lansley
 
Introduction to the R Statistical Computing Environment
Introduction to the R Statistical Computing EnvironmentIntroduction to the R Statistical Computing Environment
Introduction to the R Statistical Computing Environmentizahn
 
SF Python Meetup: TextRank in Python
SF Python Meetup: TextRank in PythonSF Python Meetup: TextRank in Python
SF Python Meetup: TextRank in PythonPaco Nathan
 

Was ist angesagt? (19)

Semantics 2017 - Trying Not to Die Benchmarking using LITMUS
Semantics 2017 - Trying Not to Die Benchmarking using LITMUSSemantics 2017 - Trying Not to Die Benchmarking using LITMUS
Semantics 2017 - Trying Not to Die Benchmarking using LITMUS
 
Achieving time effective federated information from scalable rdf data using s...
Achieving time effective federated information from scalable rdf data using s...Achieving time effective federated information from scalable rdf data using s...
Achieving time effective federated information from scalable rdf data using s...
 
Why R? A Brief Introduction to the Open Source Statistics Platform
Why R? A Brief Introduction to the Open Source Statistics PlatformWhy R? A Brief Introduction to the Open Source Statistics Platform
Why R? A Brief Introduction to the Open Source Statistics Platform
 
Effective and Efficient Entity Search in RDF data
Effective and Efficient Entity Search in RDF dataEffective and Efficient Entity Search in RDF data
Effective and Efficient Entity Search in RDF data
 
R programming Language , Rahul Singh
R programming Language , Rahul SinghR programming Language , Rahul Singh
R programming Language , Rahul Singh
 
Scalable Data Analysis in R -- Lee Edlefsen
Scalable Data Analysis in R -- Lee EdlefsenScalable Data Analysis in R -- Lee Edlefsen
Scalable Data Analysis in R -- Lee Edlefsen
 
Information Content based Ranking Metric for Linked Open Vocabularies
Information Content based Ranking Metric for Linked Open VocabulariesInformation Content based Ranking Metric for Linked Open Vocabularies
Information Content based Ranking Metric for Linked Open Vocabularies
 
Employing Graph Databases as a Standardization Model towards Addressing Heter...
Employing Graph Databases as a Standardization Model towards Addressing Heter...Employing Graph Databases as a Standardization Model towards Addressing Heter...
Employing Graph Databases as a Standardization Model towards Addressing Heter...
 
SPARQL and RDF query optimization
SPARQL and RDF query optimizationSPARQL and RDF query optimization
SPARQL and RDF query optimization
 
Managing large datasets in R – ff examples and concepts
Managing large datasets in R – ff examples and conceptsManaging large datasets in R – ff examples and concepts
Managing large datasets in R – ff examples and concepts
 
The History and Use of R
The History and Use of RThe History and Use of R
The History and Use of R
 
Optimized index structures for querying rdf from the web
Optimized index structures for querying rdf from the webOptimized index structures for querying rdf from the web
Optimized index structures for querying rdf from the web
 
Survey of Graph Indexing
Survey of Graph IndexingSurvey of Graph Indexing
Survey of Graph Indexing
 
9. Searching & Sorting - Data Structures using C++ by Varsha Patil
9. Searching & Sorting - Data Structures using C++ by Varsha Patil9. Searching & Sorting - Data Structures using C++ by Varsha Patil
9. Searching & Sorting - Data Structures using C++ by Varsha Patil
 
Mansi chowkkar programming_in_data_analytics
Mansi chowkkar programming_in_data_analyticsMansi chowkkar programming_in_data_analytics
Mansi chowkkar programming_in_data_analytics
 
Range Query on Big Data Based on Map Reduce
Range Query on Big Data Based on Map ReduceRange Query on Big Data Based on Map Reduce
Range Query on Big Data Based on Map Reduce
 
Using R to Visualize Spatial Data: R as GIS - Guy Lansley
Using R to Visualize Spatial Data: R as GIS - Guy LansleyUsing R to Visualize Spatial Data: R as GIS - Guy Lansley
Using R to Visualize Spatial Data: R as GIS - Guy Lansley
 
Introduction to the R Statistical Computing Environment
Introduction to the R Statistical Computing EnvironmentIntroduction to the R Statistical Computing Environment
Introduction to the R Statistical Computing Environment
 
SF Python Meetup: TextRank in Python
SF Python Meetup: TextRank in PythonSF Python Meetup: TextRank in Python
SF Python Meetup: TextRank in Python
 

Ähnlich wie Analysis of Graph Measures on RDF Graphs

Analytics Beyond RAM Capacity using R
Analytics Beyond RAM Capacity using RAnalytics Beyond RAM Capacity using R
Analytics Beyond RAM Capacity using RAlex Palamides
 
DistLODStats: Distributed Computation of RDF Dataset Statistics - ISWC 2018 talk
DistLODStats: Distributed Computation of RDF Dataset Statistics - ISWC 2018 talkDistLODStats: Distributed Computation of RDF Dataset Statistics - ISWC 2018 talk
DistLODStats: Distributed Computation of RDF Dataset Statistics - ISWC 2018 talkGezim Sejdiu
 
Scalable and privacy-preserving data integration - part 1
Scalable and privacy-preserving data integration - part 1Scalable and privacy-preserving data integration - part 1
Scalable and privacy-preserving data integration - part 1ErhardRahm
 
FIWARE Global Summit - IDS Implementation with FIWARE Software Components
FIWARE Global Summit - IDS Implementation with FIWARE Software ComponentsFIWARE Global Summit - IDS Implementation with FIWARE Software Components
FIWARE Global Summit - IDS Implementation with FIWARE Software ComponentsFIWARE
 
Efficient Distributed In-Memory Processing of RDF Datasets - PhD Viva
Efficient Distributed In-Memory Processing of RDF Datasets - PhD VivaEfficient Distributed In-Memory Processing of RDF Datasets - PhD Viva
Efficient Distributed In-Memory Processing of RDF Datasets - PhD VivaGezim Sejdiu
 
Translation of Relational and Non-Relational Databases into RDF with xR2RML
Translation of Relational and Non-Relational Databases into RDF with xR2RMLTranslation of Relational and Non-Relational Databases into RDF with xR2RML
Translation of Relational and Non-Relational Databases into RDF with xR2RMLFranck Michel
 
Scaling up Linked Data
Scaling up Linked DataScaling up Linked Data
Scaling up Linked DataEUCLID project
 
AnzoGraph DB: Driving AI and Machine Insights with Knowledge Graphs in a Conn...
AnzoGraph DB: Driving AI and Machine Insights with Knowledge Graphs in a Conn...AnzoGraph DB: Driving AI and Machine Insights with Knowledge Graphs in a Conn...
AnzoGraph DB: Driving AI and Machine Insights with Knowledge Graphs in a Conn...Cambridge Semantics
 
Hadoop for High-Performance Climate Analytics - Use Cases and Lessons Learned
Hadoop for High-Performance Climate Analytics - Use Cases and Lessons LearnedHadoop for High-Performance Climate Analytics - Use Cases and Lessons Learned
Hadoop for High-Performance Climate Analytics - Use Cases and Lessons LearnedDataWorks Summit
 
Release webinar: Sansa and Ontario
Release webinar: Sansa and OntarioRelease webinar: Sansa and Ontario
Release webinar: Sansa and OntarioBigData_Europe
 
Matching Data Intensive Applications and Hardware/Software Architectures
Matching Data Intensive Applications and Hardware/Software ArchitecturesMatching Data Intensive Applications and Hardware/Software Architectures
Matching Data Intensive Applications and Hardware/Software ArchitecturesGeoffrey Fox
 
Matching Data Intensive Applications and Hardware/Software Architectures
Matching Data Intensive Applications and Hardware/Software ArchitecturesMatching Data Intensive Applications and Hardware/Software Architectures
Matching Data Intensive Applications and Hardware/Software ArchitecturesGeoffrey Fox
 
AstraZeneca - Re-imagining the Data Landscape in Compound Synthesis & Management
AstraZeneca - Re-imagining the Data Landscape in Compound Synthesis & ManagementAstraZeneca - Re-imagining the Data Landscape in Compound Synthesis & Management
AstraZeneca - Re-imagining the Data Landscape in Compound Synthesis & ManagementNeo4j
 
Azure Databricks for Data Scientists
Azure Databricks for Data ScientistsAzure Databricks for Data Scientists
Azure Databricks for Data ScientistsRichard Garris
 
Scaling up Linked Data
Scaling up Linked DataScaling up Linked Data
Scaling up Linked DataMarin Dimitrov
 
Data Science & Big Data - Theory.pdf
Data Science & Big Data - Theory.pdfData Science & Big Data - Theory.pdf
Data Science & Big Data - Theory.pdfRAKESHG79
 

Ähnlich wie Analysis of Graph Measures on RDF Graphs (20)

Analytics Beyond RAM Capacity using R
Analytics Beyond RAM Capacity using RAnalytics Beyond RAM Capacity using R
Analytics Beyond RAM Capacity using R
 
DistLODStats: Distributed Computation of RDF Dataset Statistics - ISWC 2018 talk
DistLODStats: Distributed Computation of RDF Dataset Statistics - ISWC 2018 talkDistLODStats: Distributed Computation of RDF Dataset Statistics - ISWC 2018 talk
DistLODStats: Distributed Computation of RDF Dataset Statistics - ISWC 2018 talk
 
Scalable and privacy-preserving data integration - part 1
Scalable and privacy-preserving data integration - part 1Scalable and privacy-preserving data integration - part 1
Scalable and privacy-preserving data integration - part 1
 
FIWARE Global Summit - IDS Implementation with FIWARE Software Components
FIWARE Global Summit - IDS Implementation with FIWARE Software ComponentsFIWARE Global Summit - IDS Implementation with FIWARE Software Components
FIWARE Global Summit - IDS Implementation with FIWARE Software Components
 
Efficient Distributed In-Memory Processing of RDF Datasets - PhD Viva
Efficient Distributed In-Memory Processing of RDF Datasets - PhD VivaEfficient Distributed In-Memory Processing of RDF Datasets - PhD Viva
Efficient Distributed In-Memory Processing of RDF Datasets - PhD Viva
 
R tutorial
R tutorialR tutorial
R tutorial
 
Translation of Relational and Non-Relational Databases into RDF with xR2RML
Translation of Relational and Non-Relational Databases into RDF with xR2RMLTranslation of Relational and Non-Relational Databases into RDF with xR2RML
Translation of Relational and Non-Relational Databases into RDF with xR2RML
 
Scaling up Linked Data
Scaling up Linked DataScaling up Linked Data
Scaling up Linked Data
 
AnzoGraph DB: Driving AI and Machine Insights with Knowledge Graphs in a Conn...
AnzoGraph DB: Driving AI and Machine Insights with Knowledge Graphs in a Conn...AnzoGraph DB: Driving AI and Machine Insights with Knowledge Graphs in a Conn...
AnzoGraph DB: Driving AI and Machine Insights with Knowledge Graphs in a Conn...
 
Hadoop for High-Performance Climate Analytics - Use Cases and Lessons Learned
Hadoop for High-Performance Climate Analytics - Use Cases and Lessons LearnedHadoop for High-Performance Climate Analytics - Use Cases and Lessons Learned
Hadoop for High-Performance Climate Analytics - Use Cases and Lessons Learned
 
Release webinar: Sansa and Ontario
Release webinar: Sansa and OntarioRelease webinar: Sansa and Ontario
Release webinar: Sansa and Ontario
 
Visualization Proess
Visualization ProessVisualization Proess
Visualization Proess
 
Matching Data Intensive Applications and Hardware/Software Architectures
Matching Data Intensive Applications and Hardware/Software ArchitecturesMatching Data Intensive Applications and Hardware/Software Architectures
Matching Data Intensive Applications and Hardware/Software Architectures
 
Matching Data Intensive Applications and Hardware/Software Architectures
Matching Data Intensive Applications and Hardware/Software ArchitecturesMatching Data Intensive Applications and Hardware/Software Architectures
Matching Data Intensive Applications and Hardware/Software Architectures
 
ADAPTER
ADAPTERADAPTER
ADAPTER
 
AstraZeneca - Re-imagining the Data Landscape in Compound Synthesis & Management
AstraZeneca - Re-imagining the Data Landscape in Compound Synthesis & ManagementAstraZeneca - Re-imagining the Data Landscape in Compound Synthesis & Management
AstraZeneca - Re-imagining the Data Landscape in Compound Synthesis & Management
 
Azure Databricks for Data Scientists
Azure Databricks for Data ScientistsAzure Databricks for Data Scientists
Azure Databricks for Data Scientists
 
Scaling up Linked Data
Scaling up Linked DataScaling up Linked Data
Scaling up Linked Data
 
Data Science & Big Data - Theory.pdf
Data Science & Big Data - Theory.pdfData Science & Big Data - Theory.pdf
Data Science & Big Data - Theory.pdf
 
Matlab, Big Data, and HDF Server
Matlab, Big Data, and HDF ServerMatlab, Big Data, and HDF Server
Matlab, Big Data, and HDF Server
 

Kürzlich hochgeladen

UNIT-V FMM.HYDRAULIC TURBINE - Construction and working
UNIT-V FMM.HYDRAULIC TURBINE - Construction and workingUNIT-V FMM.HYDRAULIC TURBINE - Construction and working
UNIT-V FMM.HYDRAULIC TURBINE - Construction and workingrknatarajan
 
Top Rated Pune Call Girls Budhwar Peth ⟟ 6297143586 ⟟ Call Me For Genuine Se...
Top Rated  Pune Call Girls Budhwar Peth ⟟ 6297143586 ⟟ Call Me For Genuine Se...Top Rated  Pune Call Girls Budhwar Peth ⟟ 6297143586 ⟟ Call Me For Genuine Se...
Top Rated Pune Call Girls Budhwar Peth ⟟ 6297143586 ⟟ Call Me For Genuine Se...Call Girls in Nagpur High Profile
 
HARMONY IN THE NATURE AND EXISTENCE - Unit-IV
HARMONY IN THE NATURE AND EXISTENCE - Unit-IVHARMONY IN THE NATURE AND EXISTENCE - Unit-IV
HARMONY IN THE NATURE AND EXISTENCE - Unit-IVRajaP95
 
(MEERA) Dapodi Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Escorts
(MEERA) Dapodi Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Escorts(MEERA) Dapodi Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Escorts
(MEERA) Dapodi Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Escortsranjana rawat
 
(ANVI) Koregaon Park Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(ANVI) Koregaon Park Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...(ANVI) Koregaon Park Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(ANVI) Koregaon Park Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...ranjana rawat
 
Sheet Pile Wall Design and Construction: A Practical Guide for Civil Engineer...
Sheet Pile Wall Design and Construction: A Practical Guide for Civil Engineer...Sheet Pile Wall Design and Construction: A Practical Guide for Civil Engineer...
Sheet Pile Wall Design and Construction: A Practical Guide for Civil Engineer...Dr.Costas Sachpazis
 
Structural Analysis and Design of Foundations: A Comprehensive Handbook for S...
Structural Analysis and Design of Foundations: A Comprehensive Handbook for S...Structural Analysis and Design of Foundations: A Comprehensive Handbook for S...
Structural Analysis and Design of Foundations: A Comprehensive Handbook for S...Dr.Costas Sachpazis
 
VIP Call Girls Service Hitech City Hyderabad Call +91-8250192130
VIP Call Girls Service Hitech City Hyderabad Call +91-8250192130VIP Call Girls Service Hitech City Hyderabad Call +91-8250192130
VIP Call Girls Service Hitech City Hyderabad Call +91-8250192130Suhani Kapoor
 
HARDNESS, FRACTURE TOUGHNESS AND STRENGTH OF CERAMICS
HARDNESS, FRACTURE TOUGHNESS AND STRENGTH OF CERAMICSHARDNESS, FRACTURE TOUGHNESS AND STRENGTH OF CERAMICS
HARDNESS, FRACTURE TOUGHNESS AND STRENGTH OF CERAMICSRajkumarAkumalla
 
Call for Papers - Educational Administration: Theory and Practice, E-ISSN: 21...
Call for Papers - Educational Administration: Theory and Practice, E-ISSN: 21...Call for Papers - Educational Administration: Theory and Practice, E-ISSN: 21...
Call for Papers - Educational Administration: Theory and Practice, E-ISSN: 21...Christo Ananth
 
OSVC_Meta-Data based Simulation Automation to overcome Verification Challenge...
OSVC_Meta-Data based Simulation Automation to overcome Verification Challenge...OSVC_Meta-Data based Simulation Automation to overcome Verification Challenge...
OSVC_Meta-Data based Simulation Automation to overcome Verification Challenge...Soham Mondal
 
High Profile Call Girls Nagpur Isha Call 7001035870 Meet With Nagpur Escorts
High Profile Call Girls Nagpur Isha Call 7001035870 Meet With Nagpur EscortsHigh Profile Call Girls Nagpur Isha Call 7001035870 Meet With Nagpur Escorts
High Profile Call Girls Nagpur Isha Call 7001035870 Meet With Nagpur Escortsranjana rawat
 
Introduction to Multiple Access Protocol.pptx
Introduction to Multiple Access Protocol.pptxIntroduction to Multiple Access Protocol.pptx
Introduction to Multiple Access Protocol.pptxupamatechverse
 
CCS335 _ Neural Networks and Deep Learning Laboratory_Lab Complete Record
CCS335 _ Neural Networks and Deep Learning Laboratory_Lab Complete RecordCCS335 _ Neural Networks and Deep Learning Laboratory_Lab Complete Record
CCS335 _ Neural Networks and Deep Learning Laboratory_Lab Complete RecordAsst.prof M.Gokilavani
 
Call Girls Service Nagpur Tanvi Call 7001035870 Meet With Nagpur Escorts
Call Girls Service Nagpur Tanvi Call 7001035870 Meet With Nagpur EscortsCall Girls Service Nagpur Tanvi Call 7001035870 Meet With Nagpur Escorts
Call Girls Service Nagpur Tanvi Call 7001035870 Meet With Nagpur EscortsCall Girls in Nagpur High Profile
 
UNIT-III FMM. DIMENSIONAL ANALYSIS
UNIT-III FMM.        DIMENSIONAL ANALYSISUNIT-III FMM.        DIMENSIONAL ANALYSIS
UNIT-III FMM. DIMENSIONAL ANALYSISrknatarajan
 
Microscopic Analysis of Ceramic Materials.pptx
Microscopic Analysis of Ceramic Materials.pptxMicroscopic Analysis of Ceramic Materials.pptx
Microscopic Analysis of Ceramic Materials.pptxpurnimasatapathy1234
 
SPICE PARK APR2024 ( 6,793 SPICE Models )
SPICE PARK APR2024 ( 6,793 SPICE Models )SPICE PARK APR2024 ( 6,793 SPICE Models )
SPICE PARK APR2024 ( 6,793 SPICE Models )Tsuyoshi Horigome
 

Kürzlich hochgeladen (20)

UNIT-V FMM.HYDRAULIC TURBINE - Construction and working
UNIT-V FMM.HYDRAULIC TURBINE - Construction and workingUNIT-V FMM.HYDRAULIC TURBINE - Construction and working
UNIT-V FMM.HYDRAULIC TURBINE - Construction and working
 
Top Rated Pune Call Girls Budhwar Peth ⟟ 6297143586 ⟟ Call Me For Genuine Se...
Top Rated  Pune Call Girls Budhwar Peth ⟟ 6297143586 ⟟ Call Me For Genuine Se...Top Rated  Pune Call Girls Budhwar Peth ⟟ 6297143586 ⟟ Call Me For Genuine Se...
Top Rated Pune Call Girls Budhwar Peth ⟟ 6297143586 ⟟ Call Me For Genuine Se...
 
HARMONY IN THE NATURE AND EXISTENCE - Unit-IV
HARMONY IN THE NATURE AND EXISTENCE - Unit-IVHARMONY IN THE NATURE AND EXISTENCE - Unit-IV
HARMONY IN THE NATURE AND EXISTENCE - Unit-IV
 
(MEERA) Dapodi Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Escorts
(MEERA) Dapodi Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Escorts(MEERA) Dapodi Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Escorts
(MEERA) Dapodi Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Escorts
 
DJARUM4D - SLOT GACOR ONLINE | SLOT DEMO ONLINE
DJARUM4D - SLOT GACOR ONLINE | SLOT DEMO ONLINEDJARUM4D - SLOT GACOR ONLINE | SLOT DEMO ONLINE
DJARUM4D - SLOT GACOR ONLINE | SLOT DEMO ONLINE
 
(ANVI) Koregaon Park Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(ANVI) Koregaon Park Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...(ANVI) Koregaon Park Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(ANVI) Koregaon Park Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
 
Sheet Pile Wall Design and Construction: A Practical Guide for Civil Engineer...
Sheet Pile Wall Design and Construction: A Practical Guide for Civil Engineer...Sheet Pile Wall Design and Construction: A Practical Guide for Civil Engineer...
Sheet Pile Wall Design and Construction: A Practical Guide for Civil Engineer...
 
Structural Analysis and Design of Foundations: A Comprehensive Handbook for S...
Structural Analysis and Design of Foundations: A Comprehensive Handbook for S...Structural Analysis and Design of Foundations: A Comprehensive Handbook for S...
Structural Analysis and Design of Foundations: A Comprehensive Handbook for S...
 
VIP Call Girls Service Hitech City Hyderabad Call +91-8250192130
VIP Call Girls Service Hitech City Hyderabad Call +91-8250192130VIP Call Girls Service Hitech City Hyderabad Call +91-8250192130
VIP Call Girls Service Hitech City Hyderabad Call +91-8250192130
 
HARDNESS, FRACTURE TOUGHNESS AND STRENGTH OF CERAMICS
HARDNESS, FRACTURE TOUGHNESS AND STRENGTH OF CERAMICSHARDNESS, FRACTURE TOUGHNESS AND STRENGTH OF CERAMICS
HARDNESS, FRACTURE TOUGHNESS AND STRENGTH OF CERAMICS
 
Call for Papers - Educational Administration: Theory and Practice, E-ISSN: 21...
Call for Papers - Educational Administration: Theory and Practice, E-ISSN: 21...Call for Papers - Educational Administration: Theory and Practice, E-ISSN: 21...
Call for Papers - Educational Administration: Theory and Practice, E-ISSN: 21...
 
OSVC_Meta-Data based Simulation Automation to overcome Verification Challenge...
OSVC_Meta-Data based Simulation Automation to overcome Verification Challenge...OSVC_Meta-Data based Simulation Automation to overcome Verification Challenge...
OSVC_Meta-Data based Simulation Automation to overcome Verification Challenge...
 
High Profile Call Girls Nagpur Isha Call 7001035870 Meet With Nagpur Escorts
High Profile Call Girls Nagpur Isha Call 7001035870 Meet With Nagpur EscortsHigh Profile Call Girls Nagpur Isha Call 7001035870 Meet With Nagpur Escorts
High Profile Call Girls Nagpur Isha Call 7001035870 Meet With Nagpur Escorts
 
Introduction to Multiple Access Protocol.pptx
Introduction to Multiple Access Protocol.pptxIntroduction to Multiple Access Protocol.pptx
Introduction to Multiple Access Protocol.pptx
 
CCS335 _ Neural Networks and Deep Learning Laboratory_Lab Complete Record
CCS335 _ Neural Networks and Deep Learning Laboratory_Lab Complete RecordCCS335 _ Neural Networks and Deep Learning Laboratory_Lab Complete Record
CCS335 _ Neural Networks and Deep Learning Laboratory_Lab Complete Record
 
Call Girls Service Nagpur Tanvi Call 7001035870 Meet With Nagpur Escorts
Call Girls Service Nagpur Tanvi Call 7001035870 Meet With Nagpur EscortsCall Girls Service Nagpur Tanvi Call 7001035870 Meet With Nagpur Escorts
Call Girls Service Nagpur Tanvi Call 7001035870 Meet With Nagpur Escorts
 
UNIT-III FMM. DIMENSIONAL ANALYSIS
UNIT-III FMM.        DIMENSIONAL ANALYSISUNIT-III FMM.        DIMENSIONAL ANALYSIS
UNIT-III FMM. DIMENSIONAL ANALYSIS
 
Microscopic Analysis of Ceramic Materials.pptx
Microscopic Analysis of Ceramic Materials.pptxMicroscopic Analysis of Ceramic Materials.pptx
Microscopic Analysis of Ceramic Materials.pptx
 
Roadmap to Membership of RICS - Pathways and Routes
Roadmap to Membership of RICS - Pathways and RoutesRoadmap to Membership of RICS - Pathways and Routes
Roadmap to Membership of RICS - Pathways and Routes
 
SPICE PARK APR2024 ( 6,793 SPICE Models )
SPICE PARK APR2024 ( 6,793 SPICE Models )SPICE PARK APR2024 ( 6,793 SPICE Models )
SPICE PARK APR2024 ( 6,793 SPICE Models )
 

Analysis of Graph Measures on RDF Graphs

  • 1. A Software Framework and Datasets for the Analysis of Graph Measures on RDF Graphs Matthäus Zloch1, Maribel Acosta2, Daniel Hienert1, Stefan Dietze1,3, Stefan Conrad3 1 GESIS - Leibniz-Institute for the Social Sciences, Germany 2 Karlsruhe Institute of Technology, Germany 3 Institute for Computer Science, Heinrich-Heine University, Germany
  • 2. Motivation Studying graph topologies is relevant because  availability and linkage of RDF data sets grow  various research areas rely on meaningful statistics and measures We want to study the topology of RDF graphs  not at instance- or schema-level  but about the implicit data structure on RDF data graphs 2 Why studying graph topologies is relevant
  • 3. Graph-based model of RDF 3 oo o o o - # vertices and # edges - # parallel edges - density or reciprocity - degree-based measures (s, p, o) s o p p p p p p p os p Why studying graph topologies is relevant
  • 4. Research areas that may benefit  Benchmarking – Designers may use the measures to generate more representative synthetic datasets 4 Why studying graph topologies is relevant
  • 5. Research areas that may benefit  Benchmarking – Designers may use the measures to generate more representative synthetic datasets  Sampling – more representative samples in terms of the structure 5 Why studying graph topologies is relevant
  • 6. Research areas that may benefit  Benchmarking – Designers may use the measures to generate more representative synthetic datasets  Sampling – more representative samples in terms of the structure  Profiling and Evolution – monitor the change in structure over time (influence vs. prominence) 6 Why studying graph topologies is relevant
  • 7. Resource Paper Contribution Our paper introduces two resources 1. An open source framework to acquire, prepare, and perform analyses of graph-based measures on RDF graphs [1] 2. A dataset of 280 RDF datasets from the LOD Cloud late 2017, pre-processed and ready to be re-used. Browsable version available [2] 7 [1] https://github.com/mazlo/lodcc [2] https://data.gesis.org/lodcc/2017-08
  • 8. Framework’s Processing Pipeline 8 How to acquire, prepare, and perform a graph-based analysis on RDF [3] Debattista, J., Lange, C., Auer, S. & Cortis, D. (2018). Evaluating the quality of the LOD cloud: An empirical investigation.. Semantic Web, 9, 859-901. DOI 10.3233/SW-180306
  • 9. Dataset’s Metadata Preparation 9  Optional. Preparation of an offline list of all datasets, e.g. for parallel processing.  List should contain all dataset names, the (official) media type format with URLs, domain class, and modification date. How to acquire, prepare, and perform a graph-based analysis on RDF
  • 10. Graph-Object Preparation 10  Downloads the dump, extracts*, transforms*, and groups* RDF files  N-triples format is used to transform into an edgelist structure * if necessary How to acquire, prepare, and perform a graph-based analysis on RDF
  • 11. Graph-Object Preparation 11 s o (s, p, o) p <http://../dataset/whisky-circle-info> <http://..title> "Whisky Circle"@en . How to acquire, prepare, and perform a graph-based analysis on RDF  As N-Triples
  • 12. Graph-Object Preparation  As N-Triples  use non-cryptographic hashing function to „encode“ the data [3] 12 <http://../dataset/whisky-circle-info> <http://..title> "Whisky Circle"@en . 43f2f4f2e41ae099 c9643559faeed68e 02325f53aeba2f02 (s, p, o) s o p [3] xxhash, https://github.com/Cyan4973/xxHash How to acquire, prepare, and perform a graph-based analysis on RDF
  • 13. Graph-Object Preparation  As N-Triples  As edgelist 13 (s, p, o) source vertex target vertex edge-property 43f2f4f2e41ae099 02325f53aeba2f02 c9643559faeed68e <http://../dataset/whisky-circle-info> <http://..title> "Whisky Circle"@en . 43f2f4f2e41ae099 c9643559faeed68e 02325f53aeba2f02 s o p How to acquire, prepare, and perform a graph-based analysis on RDF
  • 14. Graph-Object Instantiation 14  Reads edgelist and builds graph structure  Reports results on measures from 5 dimensions How to acquire, prepare, and perform a graph-based analysis on RDF
  • 15. Library re-use 15 How to acquire, prepare, and perform a graph-based analysis on RDF [4] https://old.datahub.io/dataset/<dataset-name>/datapackage.json [5] Wget, https://www.gnu.org/software/wget/ [6] dtrx, https://github.com/moonpyk/dtrx [7] rapper, http://librdf.org/raptor/rapper.html [8] xxhash, https://github.com/Cyan4973/xxHash [9] graph-tool, https://graph-tool.skewed.de/ [4] [6,7,8][9] [5]
  • 16. Groups of Measures Framework reports on 28 measures from 5 groups 16 How to acquire, prepare, and perform a graph-based analysis on RDF • no. of vertices, edges • parallel edges • unique edges Basic graph measures • max-[in|out]-degree • average degree • h-index (direct./undirect.) Degree-based measures • graph centralization • max degree centrality Centrality measures
  • 17. Groups of Measures Framework reports on 28 measures from 5 groups 17 How to acquire, prepare, and perform a graph-based analysis on RDF • no. of vertices, edges • parallel edges • unique edges Basic graph measures • max-[in|out]-degree • average degree • h-index (direct./undirect.) Degree-based measures • graph centralization • max degree centrality Centrality measures • density • reciprocity • diameter Edge-based measures • variance, standard dev., coefficient of var. • degree-distribution, powerlaw-exponent alpha Descriptive stat. measures
  • 18. Performance Example: datasets and sizes 18 How to acquire, prepare, and perform a graph-based analysis on RDF
  • 19. Performance Example: datasets and sizes 19 How to acquire, prepare, and perform a graph-based analysis on RDF
  • 20. Performance Example: datasets and sizes Example: runtimes 20 How to acquire, prepare, and perform a graph-based analysis on RDF
  • 21. Performance Example: datasets and sizes Example: runtimes 21 How to acquire, prepare, and perform a graph-based analysis on RDF
  • 22. 22 Datasets from 9 domains in LOD Cloud  12 in May 2007  570 in August 2014  1163 in August 2017  1224 in August 2018  1239 in March 2019 A Dataset of Pre-Processed RDF Graphs
  • 23. A Dataset of Pre-Processed RDF Graphs  Total of 280 RDF datasets processed and analyzed  Values for 28 measures per dataset  Graph-objects ready to be re-used, results as CSV, and original link to metadata 23 Case Study with Datasets from LOD Cloud Available at our website https://data.gesis.org/lodcc/2017-08
  • 24. Graph-based Analysis at large scale To analyze RDF graphs at large scale you have to  Download the list of available datasets  Acquire the datasets  Represent as a graph-object  Compute graph measures on that Sounds easy, right? 24 Case Study with Datasets from LOD Cloud
  • 25. Graph-based Analysis at large scale In reality not that easy  not all data providers offer data dumps  non-standard media type declarations  various formats, compressed archives, hierarchies of files and folders  erroneous/error-prone data 25 Case Study with Datasets from LOD Cloud
  • 26. Acquisition and Preparation 26 1163 • metadata packages 890 • 150 different media type statements • URLs for the official media type statements that are supported 486 • after filtering 404 and content-type HTML 280 • left out SPARQL-Endpoints • after graph preparation with corrupt downloads, wrong media type statements, syntax errors Case Study with Datasets from LOD Cloud
  • 27. Processed Datasets by Domain 27 Case Study with Datasets from LOD Cloud
  • 28. Processed Datasets by Domain 28 Case Study with Datasets from LOD Cloud
  • 29.  Average degree z seems not affected by number of edges, in all but Geography and Government  Average edges per vertex  Life Sciences: 63.50  Cross Domain: 5.46  Average overall domains: 7.9 edges per vertex 29 Preliminary Analysis of Results Preliminary Analysis of Results
  • 30. Preliminary Analysis of Results  hd grows exponentially with number of edges  Life Sciences and Government are more “dense”  Linguistics forms two clusters, almost no dependency to the number of edges, low on avg. 30 Preliminary Analysis of Results
  • 31. Availability, Maintenance, Sustainability 31 • Framework is published under MIT license on GitHub. https://github.com/mazlo/lodcc • Actively used in other research activities. • Future releases (minor, bugfixes, features) The framework • Recalculate for newer versions of the LOD Cloud • Made available to the community • Combine with other datasets http://stats.lod2.euThe datasets
  • 32. Future Work and Research  Investigate domain- and dataset-specific irregularities  Derive implications for modelling tasks, on dataset level and applications like benchmarking  Offer SPARQL-endpoint to query results 32
  • 33. Thank you for your attention [1] https://github.com/mazlo/lodcc [2] https://data.gesis.org/lodcc/2017-08 @matzlo

Hinweis der Redaktion

  1. Our motivation is the study of graph topologies, which is interesting because the availability and linkage of RDF datasets grow. As this number rises we need to collect meaningful statistics and measures to describe the data. Many approaches collect statistics at instance- and schema-level mostly, but not necessarily from the data structure that an RDF dataset comes with, the RDF data graph. Various research areas rely on statistics and measures, e.g. data-driven tasks like query processing, studies on the quality of data sets, monitoring services of the evolution of the space, are some examples.
  2. The implicit data structure that we get from a set of RDF triples compose a directed and labelled graph, where subjects and objects can be defined as vertices while predicates correspond to edges. So, when we build up a graph-object from this we will be able to compute various measures like ..
  3. We can think of various research areas that may benefit from such analyses. For instance, BENCHMARKING Benchmark suites e.g. aim at designing a simulation of a real-world scenario, in that they have a synthetic dataset generator and common queries. If we look closer, we can see that benchmark datasets interprets growth in terms of number of edges and max in degree, not with max out degree. density of the graph shrinks. some have no reciprocity.
  4. SAMPLING Almost the same applies to sampling methods where here research aims at delivering a representative sample from an original dataset. Example questions that arise in this field: What does representative mean? How to obtain a (minimal) representative sample? Which method to use? Apart from qualitative aspects, like classes, properties, instances, and used vocabularies, also topological characteristics should be considered, since they allow for a more accurate description of the dataset. This applies to all graph-based datasets, and is not a LD/RDF-specific issue.
  5. PROFILING With the growing number of datasets in the LOD Cloud, the linkage and connectivity is of particular interest. Graph measures may help to monitor changes and the impact of changes in datasets or even domains over time.
  6. First… And second, a dataset of 280 RDF datasets from LOD Cloud late 2017, that we processed with the framework. As part of the resource is a website that presents these results for all of the datasets. The datasets are pre-processed and ready to be re-used by you. In the next slides I am going to present you how the framework works, how we did the case study on the LOD Cloud late 2017, and a report on a preliminary analysis of these results over all domains.
  7. This is how the framework works. To be able to instantiate a graph-object from an RDF dataset, we have come up with this pipeline. This can also be found in related work and in other studies. …
  8. The first two steps are optional. First, corresponding metadata will be loaded from datahub, parsed for mediatypes, and saved into a local database. This is advisable for parallel processing, as it is highly recommended when you have many datasets. The framework can work with both, a database connection or command line arguments if you have no database. The list should contain all names, media type statements with URLs, domain affiliation, and modification date. The framework is currently limited to work official media types for the most common formats for RDF data, which are N-Triples, RDF/XML, Turtle, NQuads, and Notation3.
  9. Dumps will then get downloaded, extracted, transformed, and grouped in case of archives with multiple files. In order to build a graph-object one can use an edgelist, which is a list of source and target vertices per line. That is why the N-Triples format is very handy and why we need the transformation procedure.
  10. There is an example of a statement transformed into N-Triples format. An issue with N-Triples is however, that it adds a lot of boilerplate text, because a lot of information gets repeated, mainly the URLs, and so the graph objects will get large, on hard disk and in-memory.
  11. Therefore we used a non-cryptographic hashing function to “encode” each part of the triple. Encoding data in such a form has many advantages, e.g. saves memory, as per average only 20% of the characters have to be stored. and, it makes graphs be comparable in terms of contained vertices and edges, because a hash for a URL in dataset 1 will be the same for dataset 2.
  12. To build up the edgelist we just changed the position of the O and P, making the P an additional property of the edge in the graph object that is stored in addition.
  13. In the last steps the graph object is build up from the edgelist and the measures get calculated. The framework can be configured to be used in parallel. It depends on your network connection, CPU cores, and hard disk IO how long it takes to complete.
  14. The framework computes 28 graph-based measures which can be grouped into 5 groups. Here are some examples. BASIC : number of vertices and edges, parallel edges, unique edges. DEGREE-BASED : max-(in,out)-degree, avg degree, h-index (directed and undirected) CENTRALITY : graph centralization, max CD EDGE-BASED : density (ratio all edges to all possible edges), reciprocity, diameter DESCRIPTIVE : variance, std dev, coefficient of variation, degree-distribution and powerlaw-exponent alpha
  15. For example BASIC : number of vertices and edges, parallel edges, unique edges. DEGREE-BASED : max-(in,out)-degree, avg degree, h-index (directed and undirected) CENTRALITY : graph centralization, max CD EDGE-BASED : density (ratio all edges to all possible edges), reciprocity, diameter DESCRIPTIVE : variance, std dev, coefficient of variation, degree-distribution and powerlaw-exponent alpha
  16. This does not necessarily shows how well our framework works, but rather how well the underlying libraries work that we are using, e.g. dtrx, rapper, and graph-tool.
  17. This does not necessarily shows how well our framework works, but rather how well the underlying libraries work that we are using, e.g. dtrx, rapper, and graph-tool.
  18. This does not necessarily shows how well our framework works, but rather how well the underlying libraries work that we are using, e.g. dtrx, rapper, and graph-tool.
  19. This does not necessarily shows how well our framework works, but rather how well the underlying libraries work that we are using, e.g. dtrx, rapper, and graph-tool.
  20. Now I will come to the second part, which is the description of datasets that we have processed with the framework. We thought there is no better stress-test for our framework than datasets from the LOD cloud. From a theoretically available number of around 1200 datasets, we managed to analyse 280 datasets from the LOD cloud late 2017, I will tell you why in a moment. This is the second resource the we publish with the paper.
  21. So this resource contains all 280 datasets that we have processed and analyzed with the framework. We got values for all 28 measures per dataset and created a website so be able to browse the results. You can download the initial metadata that was used to acquire the dump, all results as CSV file export, and a serialized graph-object that you can re-use for further analysis, for each of the datasets. All available at this website. The main benefit from this collection is that each RDF dataset is already prepared. This enables to reproduce the results and to perform further analysis of graph measures on the graphs from scratch without further preparation For all datasets we also provide plots, e.g. for the distribution of the degree.
  22. This is how we did the analysis. To analyze RDF graphs at large scale, in terms of dataset size and dataset quantity, you would have to …
  23. But in reality, not all data providers offer data dumps. And for those that offer dumps, you frequently have to deal with non-standard (wrong) media type declarations. Providers use different formats, some compress their files with different algorithms and some will give you a hierarchy of files and folders including non-RDF data. In addition, you will have erroneous and error-prone data, like syntax errors etc.
  24. At first we had all metadata package at hand. After parsing those we got 150 different media type statements. Since the framework accepts only official media type statements of the most common media types we manually mapped them. After this mapping we got 890 datasets with URLs. Further, we filtered out HTTP 404 codes and content-types HTML. This was the manual steps in the process pipeline of the framework presented earlier. Further, we concentrated on data dumps, not SPARQL-endpoint to not stress them.
  25. This is a snapshot of the website. On the left side you can see the distribution of the datasets per domain for which we were able to do the analysis. Unfortunately, some of them are not well represented.
  26. However, the largest dataset is in the Cross Domain which is en-dbpedia with 2.6B edges. Most datasets in Linguistics domain and Publications. We did a preliminary analysis of the measures across all domains and could observe dataset- and domain-specific phenomena. I would like to show you two measures, average degree and h-index on the directed graph, which we have plotted across all domains.
  27. Avg. degree is a frequently consulted measure and gives you the average number of edges that vertices have in the graph object. In this plot you can see the datasets and avg. degree values for 5 domains. The datasets are ordered descending by number of edges When you look at the plot you can see that avg. degree seems not to be affected by number of edges, in all but GE and GO. GE and GO report an increasing linear relationship. Outliers can be observed among all domains, like bio2rdf-irefindex in Life Sciences with 63. In the domain of LI there seems to be two clusters, with one group having higher values than the other. This may be considered as a dataset-specific phenomom, most probably cased by the fact that either - one data provider used a specific vocabulary and used to model more accurately (more predicates on average), OR - there were two different providers publishing a lot of small datasets of different kind. Unfortunately, this may not necessarily be representative, because not all datasets were included.
  28. The second measure that I've plotted here across all domains is the h-index, is known from citation networks. It is an indicator for the importance of a vertex in a network. Here it is a statistical measure on the graph. Each dot in the figure is a dataset. The datasets are ordered descending by number of edges y-axis is log-scaled Grows exponentially with the size of the graph. GO, LS report higher values and could be considered more "dense", PU lower values. Again Linguistics shows two clusters with almost constant values, that seem to be independent from the number of edges in most cases, in particular for the lower group of datasets.
  29. Regarding future work, we would like to investigate the domain- and dataset-specific irregularities. Where to they come from, what is the reason etc. and derive implications for modelling tasks, on dataset and application specific level like benchmarking.