Analysis of Graph Measures on RDF Graphs

A Software Framework and Datasets for the
Analysis of Graph Measures on RDF Graphs
Matthäus Zloch1, Maribel Acosta2, Daniel Hienert1,
Stefan Dietze1,3, Stefan Conrad3
1 GESIS - Leibniz-Institute for the Social Sciences, Germany
2 Karlsruhe Institute of Technology, Germany
3 Institute for Computer Science, Heinrich-Heine University, Germany

Motivation
Studying graph topologies is relevant because
 availability and linkage of RDF data sets grow
 various research areas rely on meaningful statistics and
measures
We want to study the topology of RDF graphs
 not at instance- or schema-level
 but about the implicit data structure on RDF data graphs
2
Why studying graph topologies is relevant

Graph-based model of RDF
3
oo o
o o
- # vertices and # edges
- # parallel edges
- density or reciprocity
- degree-based measures
(s, p, o)
s o
p
p
p
p
p
p
p
os
p

Research areas that may benefit
 Benchmarking – Designers may use the measures to
generate more representative synthetic datasets
4

 Sampling – more representative samples in terms of
the structure
5

 Sampling – more representative samples in terms of
the structure
 Profiling and Evolution – monitor the change in
structure over time (influence vs. prominence)
6

Resource Paper Contribution
Our paper introduces two resources
1. An open source framework to acquire, prepare, and perform
analyses of graph-based measures on RDF graphs [1]
2. A dataset of 280 RDF datasets from the LOD Cloud late 2017,
pre-processed and ready to be re-used. Browsable version
available [2]
7
[1] https://github.com/mazlo/lodcc
[2] https://data.gesis.org/lodcc/2017-08

Framework’s Processing Pipeline
8
How to acquire, prepare, and perform a graph-based analysis on RDF
[3] Debattista, J., Lange, C., Auer, S. & Cortis, D. (2018). Evaluating the quality of the LOD cloud:
An empirical investigation.. Semantic Web, 9, 859-901. DOI 10.3233/SW-180306

Dataset’s Metadata Preparation
9
 Optional. Preparation of an offline list of all datasets,
e.g. for parallel processing.
 List should contain all dataset names, the (official)
media type format with URLs, domain class, and
modification date.

Graph-Object Preparation
10
 Downloads the dump, extracts*, transforms*, and
groups* RDF files
 N-triples format is used to transform into an edgelist
structure
* if necessary

11
s o
(s, p, o)
p
<http://../dataset/whisky-circle-info> <http://..title> "Whisky Circle"@en .
 As N-Triples

 As N-Triples
 use non-cryptographic hashing function to „encode“
the data [3]
12
43f2f4f2e41ae099 c9643559faeed68e 02325f53aeba2f02
(s, p, o)
s o
p
[3] xxhash, https://github.com/Cyan4973/xxHash

 As N-Triples
 As edgelist
13
(s, p, o)
source vertex target vertex edge-property
43f2f4f2e41ae099 02325f53aeba2f02 c9643559faeed68e
43f2f4f2e41ae099 c9643559faeed68e 02325f53aeba2f02
s o
p

Graph-Object Instantiation
14
 Reads edgelist and builds graph structure
 Reports results on measures from 5 dimensions

Library re-use
15
[4] https://old.datahub.io/dataset/<dataset-name>/datapackage.json
[5] Wget, https://www.gnu.org/software/wget/
[6] dtrx, https://github.com/moonpyk/dtrx
[7] rapper, http://librdf.org/raptor/rapper.html
[8] xxhash, https://github.com/Cyan4973/xxHash
[9] graph-tool, https://graph-tool.skewed.de/
[4]
[6,7,8][9]
[5]

Groups of Measures
Framework reports on 28 measures from 5 groups
16
• no. of vertices, edges
• parallel edges
• unique edges
Basic graph
measures
• max-[in|out]-degree
• average degree
• h-index (direct./undirect.)
Degree-based
measures
• graph centralization
• max degree centrality
Centrality
measures

Groups of Measures
Framework reports on 28 measures from 5 groups
17
• no. of vertices, edges
• parallel edges
• unique edges
Basic graph
measures
• max-[in|out]-degree
• average degree
• h-index (direct./undirect.)
Degree-based
measures
• graph centralization
• max degree centrality
Centrality
measures
• density
• reciprocity
• diameter
Edge-based
measures
• variance, standard dev., coefficient of var.
• degree-distribution, powerlaw-exponent
alpha
Descriptive stat.
measures

Performance
Example: datasets and sizes
18

Performance
19

Performance
Example: runtimes
20

Performance
Example: runtimes
21

22
Datasets from 9 domains in LOD Cloud
 12 in May 2007
 570 in August 2014
 1163 in August 2017
 1224 in August 2018
 1239 in March 2019
A Dataset of Pre-Processed RDF Graphs

A Dataset of Pre-Processed RDF Graphs
 Total of 280 RDF datasets processed and analyzed
 Values for 28 measures per dataset
 Graph-objects ready to be re-used, results as CSV, and
original link to metadata
23
Case Study with Datasets from LOD Cloud
Available at our website https://data.gesis.org/lodcc/2017-08

Graph-based Analysis at large scale
To analyze RDF graphs at large scale you have to
 Download the list of available datasets
 Acquire the datasets
 Represent as a graph-object
 Compute graph measures on that
Sounds easy, right?
24

Graph-based Analysis at large scale
In reality not that easy
 not all data providers offer data dumps
 non-standard media type declarations
 various formats, compressed archives, hierarchies of
files and folders
 erroneous/error-prone data
25

Acquisition and Preparation
26
1163
• metadata packages
890
• 150 different media type statements
• URLs for the official media type statements that are
supported
486
• after filtering 404 and content-type HTML
280
• left out SPARQL-Endpoints
• after graph preparation with corrupt downloads, wrong
media type statements, syntax errors

Processed Datasets by Domain
27

Processed Datasets by Domain
28

 Average degree z seems not affected by number of
edges, in all but Geography and Government
 Average edges per vertex
 Life Sciences: 63.50
 Cross Domain: 5.46
 Average overall domains: 7.9 edges per vertex
29
Preliminary Analysis of Results

 hd grows exponentially with number of edges
 Life Sciences and Government are more “dense”
 Linguistics forms two clusters, almost no dependency
to the number of edges, low on avg.
30

Availability, Maintenance, Sustainability
31
• Framework is published under MIT license on
GitHub. https://github.com/mazlo/lodcc
• Actively used in other research activities.
• Future releases (minor, bugfixes, features)
The framework
• Recalculate for newer versions of the LOD Cloud
• Made available to the community
• Combine with other datasets http://stats.lod2.euThe datasets

Future Work and Research
 Investigate domain- and dataset-specific irregularities
 Derive implications for modelling tasks, on dataset
level and applications like benchmarking
 Offer SPARQL-endpoint to query results
32

Thank you for your attention
[1] https://github.com/mazlo/lodcc
[2] https://data.gesis.org/lodcc/2017-08
@matzlo

Analysis of Graph Measures on RDF Graphs

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Was ist angesagt?

Was ist angesagt? (19)

Ähnlich wie Analysis of Graph Measures on RDF Graphs

Ähnlich wie Analysis of Graph Measures on RDF Graphs (20)

Kürzlich hochgeladen

Kürzlich hochgeladen (20)

Analysis of Graph Measures on RDF Graphs

Hinweis der Redaktion