This document introduces a software framework and datasets for analyzing graph measures on RDF graphs. The framework includes a processing pipeline to acquire, prepare, and analyze RDF datasets. It calculates 28 graph measures across 5 groups (basic, degree-based, centrality, edge-based, descriptive statistics) on 280 RDF datasets from the LOD Cloud. Preliminary analysis shows variation in measures across domains. The framework and pre-processed datasets are available open-source to support large-scale graph-based analysis of RDF data.
1. A Software Framework and Datasets for the
Analysis of Graph Measures on RDF Graphs
Matthäus Zloch1, Maribel Acosta2, Daniel Hienert1,
Stefan Dietze1,3, Stefan Conrad3
1 GESIS - Leibniz-Institute for the Social Sciences, Germany
2 Karlsruhe Institute of Technology, Germany
3 Institute for Computer Science, Heinrich-Heine University, Germany
2. Motivation
Studying graph topologies is relevant because
availability and linkage of RDF data sets grow
various research areas rely on meaningful statistics and
measures
We want to study the topology of RDF graphs
not at instance- or schema-level
but about the implicit data structure on RDF data graphs
2
Why studying graph topologies is relevant
3. Graph-based model of RDF
3
oo o
o o
- # vertices and # edges
- # parallel edges
- density or reciprocity
- degree-based measures
(s, p, o)
s o
p
p
p
p
p
p
p
os
p
Why studying graph topologies is relevant
4. Research areas that may benefit
Benchmarking – Designers may use the measures to
generate more representative synthetic datasets
4
Why studying graph topologies is relevant
5. Research areas that may benefit
Benchmarking – Designers may use the measures to
generate more representative synthetic datasets
Sampling – more representative samples in terms of
the structure
5
Why studying graph topologies is relevant
6. Research areas that may benefit
Benchmarking – Designers may use the measures to
generate more representative synthetic datasets
Sampling – more representative samples in terms of
the structure
Profiling and Evolution – monitor the change in
structure over time (influence vs. prominence)
6
Why studying graph topologies is relevant
7. Resource Paper Contribution
Our paper introduces two resources
1. An open source framework to acquire, prepare, and perform
analyses of graph-based measures on RDF graphs [1]
2. A dataset of 280 RDF datasets from the LOD Cloud late 2017,
pre-processed and ready to be re-used. Browsable version
available [2]
7
[1] https://github.com/mazlo/lodcc
[2] https://data.gesis.org/lodcc/2017-08
8. Framework’s Processing Pipeline
8
How to acquire, prepare, and perform a graph-based analysis on RDF
[3] Debattista, J., Lange, C., Auer, S. & Cortis, D. (2018). Evaluating the quality of the LOD cloud:
An empirical investigation.. Semantic Web, 9, 859-901. DOI 10.3233/SW-180306
9. Dataset’s Metadata Preparation
9
Optional. Preparation of an offline list of all datasets,
e.g. for parallel processing.
List should contain all dataset names, the (official)
media type format with URLs, domain class, and
modification date.
How to acquire, prepare, and perform a graph-based analysis on RDF
10. Graph-Object Preparation
10
Downloads the dump, extracts*, transforms*, and
groups* RDF files
N-triples format is used to transform into an edgelist
structure
* if necessary
How to acquire, prepare, and perform a graph-based analysis on RDF
11. Graph-Object Preparation
11
s o
(s, p, o)
p
<http://../dataset/whisky-circle-info> <http://..title> "Whisky Circle"@en .
How to acquire, prepare, and perform a graph-based analysis on RDF
As N-Triples
12. Graph-Object Preparation
As N-Triples
use non-cryptographic hashing function to „encode“
the data [3]
12
<http://../dataset/whisky-circle-info> <http://..title> "Whisky Circle"@en .
43f2f4f2e41ae099 c9643559faeed68e 02325f53aeba2f02
(s, p, o)
s o
p
[3] xxhash, https://github.com/Cyan4973/xxHash
How to acquire, prepare, and perform a graph-based analysis on RDF
13. Graph-Object Preparation
As N-Triples
As edgelist
13
(s, p, o)
source vertex target vertex edge-property
43f2f4f2e41ae099 02325f53aeba2f02 c9643559faeed68e
<http://../dataset/whisky-circle-info> <http://..title> "Whisky Circle"@en .
43f2f4f2e41ae099 c9643559faeed68e 02325f53aeba2f02
s o
p
How to acquire, prepare, and perform a graph-based analysis on RDF
14. Graph-Object Instantiation
14
Reads edgelist and builds graph structure
Reports results on measures from 5 dimensions
How to acquire, prepare, and perform a graph-based analysis on RDF
15. Library re-use
15
How to acquire, prepare, and perform a graph-based analysis on RDF
[4] https://old.datahub.io/dataset/<dataset-name>/datapackage.json
[5] Wget, https://www.gnu.org/software/wget/
[6] dtrx, https://github.com/moonpyk/dtrx
[7] rapper, http://librdf.org/raptor/rapper.html
[8] xxhash, https://github.com/Cyan4973/xxHash
[9] graph-tool, https://graph-tool.skewed.de/
[4]
[6,7,8][9]
[5]
16. Groups of Measures
Framework reports on 28 measures from 5 groups
16
How to acquire, prepare, and perform a graph-based analysis on RDF
• no. of vertices, edges
• parallel edges
• unique edges
Basic graph
measures
• max-[in|out]-degree
• average degree
• h-index (direct./undirect.)
Degree-based
measures
• graph centralization
• max degree centrality
Centrality
measures
17. Groups of Measures
Framework reports on 28 measures from 5 groups
17
How to acquire, prepare, and perform a graph-based analysis on RDF
• no. of vertices, edges
• parallel edges
• unique edges
Basic graph
measures
• max-[in|out]-degree
• average degree
• h-index (direct./undirect.)
Degree-based
measures
• graph centralization
• max degree centrality
Centrality
measures
• density
• reciprocity
• diameter
Edge-based
measures
• variance, standard dev., coefficient of var.
• degree-distribution, powerlaw-exponent
alpha
Descriptive stat.
measures
22. 22
Datasets from 9 domains in LOD Cloud
12 in May 2007
570 in August 2014
1163 in August 2017
1224 in August 2018
1239 in March 2019
A Dataset of Pre-Processed RDF Graphs
23. A Dataset of Pre-Processed RDF Graphs
Total of 280 RDF datasets processed and analyzed
Values for 28 measures per dataset
Graph-objects ready to be re-used, results as CSV, and
original link to metadata
23
Case Study with Datasets from LOD Cloud
Available at our website https://data.gesis.org/lodcc/2017-08
24. Graph-based Analysis at large scale
To analyze RDF graphs at large scale you have to
Download the list of available datasets
Acquire the datasets
Represent as a graph-object
Compute graph measures on that
Sounds easy, right?
24
Case Study with Datasets from LOD Cloud
25. Graph-based Analysis at large scale
In reality not that easy
not all data providers offer data dumps
non-standard media type declarations
various formats, compressed archives, hierarchies of
files and folders
erroneous/error-prone data
25
Case Study with Datasets from LOD Cloud
26. Acquisition and Preparation
26
1163
• metadata packages
890
• 150 different media type statements
• URLs for the official media type statements that are
supported
486
• after filtering 404 and content-type HTML
280
• left out SPARQL-Endpoints
• after graph preparation with corrupt downloads, wrong
media type statements, syntax errors
Case Study with Datasets from LOD Cloud
29. Average degree z seems not affected by number of
edges, in all but Geography and Government
Average edges per vertex
Life Sciences: 63.50
Cross Domain: 5.46
Average overall domains: 7.9 edges per vertex
29
Preliminary Analysis of Results
Preliminary Analysis of Results
30. Preliminary Analysis of Results
hd grows exponentially with number of edges
Life Sciences and Government are more “dense”
Linguistics forms two clusters, almost no dependency
to the number of edges, low on avg.
30
Preliminary Analysis of Results
31. Availability, Maintenance, Sustainability
31
• Framework is published under MIT license on
GitHub. https://github.com/mazlo/lodcc
• Actively used in other research activities.
• Future releases (minor, bugfixes, features)
The framework
• Recalculate for newer versions of the LOD Cloud
• Made available to the community
• Combine with other datasets http://stats.lod2.euThe datasets
32. Future Work and Research
Investigate domain- and dataset-specific irregularities
Derive implications for modelling tasks, on dataset
level and applications like benchmarking
Offer SPARQL-endpoint to query results
32
33. Thank you for your attention
[1] https://github.com/mazlo/lodcc
[2] https://data.gesis.org/lodcc/2017-08
@matzlo
Hinweis der Redaktion
Our motivation is the study of graph topologies, which is interesting because the availability and linkage of RDF datasets grow. As this number rises we need to collect meaningful statistics and measures to describe the data.
Many approaches collect statistics at instance- and schema-level mostly, but not necessarily from the data structure that an RDF dataset comes with, the RDF data graph.
Various research areas rely on statistics and measures, e.g. data-driven tasks like query processing, studies on the quality of data sets, monitoring services of the evolution of the space, are some examples.
The implicit data structure that we get from a set of RDF triples compose a directed and labelled graph, where subjects and objects
can be defined as vertices while predicates correspond to edges. So, when we build up a graph-object from this we will be able to compute various measures like ..
We can think of various research areas that may benefit from such analyses. For instance,
BENCHMARKING
Benchmark suites e.g. aim at designing a simulation of a real-world scenario, in that they have a synthetic dataset generator and common queries.
If we look closer, we can see that
benchmark datasets interprets growth in terms of number of edges and max in degree, not with max out degree.
density of the graph shrinks.
some have no reciprocity.
SAMPLING
Almost the same applies to sampling methods where here research aims at delivering a representative sample from an original dataset. Example questions that arise in this field: What does representative mean? How to obtain a (minimal) representative sample? Which method to use?
Apart from qualitative aspects, like classes, properties, instances, and used vocabularies, also topological characteristics should be considered, since they allow for a more accurate description of the dataset. This applies to all graph-based datasets, and is not a LD/RDF-specific issue.
PROFILING
With the growing number of datasets in the LOD Cloud, the linkage and connectivity is of particular interest. Graph measures may help to monitor changes and the impact of changes in datasets or even domains over time.
First…
And second, a dataset of 280 RDF datasets from LOD Cloud late 2017, that we processed with the framework. As part of the resource is a website that presents these results for all of the datasets. The datasets are pre-processed and ready to be re-used by you.
In the next slides I am going to present you
how the framework works,
how we did the case study on the LOD Cloud late 2017, and
a report on a preliminary analysis of these results over all domains.
This is how the framework works.
To be able to instantiate a graph-object from an RDF dataset, we have come up with this pipeline. This can also be found in related work and in other studies.
…
The first two steps are optional. First, corresponding metadata will be loaded from datahub, parsed for mediatypes, and saved into a local database. This is advisable for parallel processing, as it is highly recommended when you have many datasets. The framework can work with both, a database connection or command line arguments if you have no database.
The list should contain all names, media type statements with URLs, domain affiliation, and modification date.
The framework is currently limited to work official media types for the most common formats for RDF data, which are N-Triples, RDF/XML, Turtle, NQuads, and Notation3.
Dumps will then get downloaded, extracted, transformed, and grouped in case of archives with multiple files. In order to build a graph-object one can use an edgelist, which is a list of source and target vertices per line. That is why the N-Triples format is very handy and why we need the transformation procedure.
There is an example of a statement transformed into N-Triples format.
An issue with N-Triples is however, that it adds a lot of boilerplate text, because a lot of information gets repeated, mainly the URLs, and so the graph objects will get large, on hard disk and in-memory.
Therefore we used a non-cryptographic hashing function to “encode” each part of the triple.
Encoding data in such a form has many advantages, e.g.
saves memory, as per average only 20% of the characters have to be stored.
and, it makes graphs be comparable in terms of contained vertices and edges, because a hash for a URL in dataset 1 will be the same for dataset 2.
To build up the edgelist we just changed the position of the O and P, making the P an additional property of the edge in the graph object that is stored in addition.
In the last steps the graph object is build up from the edgelist and the measures get calculated.
The framework can be configured to be used in parallel. It depends on your network connection, CPU cores, and hard disk IO how long it takes to complete.
The framework computes 28 graph-based measures which can be grouped into 5 groups. Here are some examples.
BASIC : number of vertices and edges, parallel edges, unique edges.
DEGREE-BASED : max-(in,out)-degree, avg degree, h-index (directed and undirected)
CENTRALITY : graph centralization, max CD
EDGE-BASED : density (ratio all edges to all possible edges), reciprocity, diameter
DESCRIPTIVE : variance, std dev, coefficient of variation, degree-distribution and powerlaw-exponent alpha
For example
BASIC : number of vertices and edges, parallel edges, unique edges.
DEGREE-BASED : max-(in,out)-degree, avg degree, h-index (directed and undirected)
CENTRALITY : graph centralization, max CD
EDGE-BASED : density (ratio all edges to all possible edges), reciprocity, diameter
DESCRIPTIVE : variance, std dev, coefficient of variation, degree-distribution and powerlaw-exponent alpha
This does not necessarily shows how well our framework works, but rather how well the underlying libraries work that we are using, e.g. dtrx, rapper, and graph-tool.
This does not necessarily shows how well our framework works, but rather how well the underlying libraries work that we are using, e.g. dtrx, rapper, and graph-tool.
This does not necessarily shows how well our framework works, but rather how well the underlying libraries work that we are using, e.g. dtrx, rapper, and graph-tool.
This does not necessarily shows how well our framework works, but rather how well the underlying libraries work that we are using, e.g. dtrx, rapper, and graph-tool.
Now I will come to the second part, which is the description of datasets that we have processed with the framework.
We thought there is no better stress-test for our framework than datasets from the LOD cloud. From a theoretically available number of around 1200 datasets, we managed to analyse 280 datasets from the LOD cloud late 2017, I will tell you why in a moment. This is the second resource the we publish with the paper.
So this resource contains all 280 datasets that we have processed and analyzed with the framework. We got values for all 28 measures per dataset and created a website so be able to browse the results.
You can download the initial metadata that was used to acquire the dump, all results as CSV file export, and a serialized graph-object that you can re-use for further analysis, for each of the datasets. All available at this website.
The main benefit from this collection is that each RDF dataset is already prepared. This enables to reproduce the results and to perform further analysis of graph measures on the graphs from scratch without further preparation
For all datasets we also provide plots, e.g. for the distribution of the degree.
This is how we did the analysis. To analyze RDF graphs at large scale, in terms of dataset size and dataset quantity, you would have to …
But in reality, not all data providers offer data dumps. And for those that offer dumps, you frequently have to deal with non-standard (wrong) media type declarations.
Providers use different formats, some compress their files with different algorithms and some will give you a hierarchy of files and folders including non-RDF data.
In addition, you will have erroneous and error-prone data, like syntax errors etc.
At first we had all metadata package at hand. After parsing those we got 150 different media type statements. Since the framework accepts only official media type statements of the most common media types we manually mapped them.
After this mapping we got 890 datasets with URLs. Further, we filtered out HTTP 404 codes and content-types HTML. This was the manual steps in the process pipeline of the framework presented earlier.
Further, we concentrated on data dumps, not SPARQL-endpoint to not stress them.
This is a snapshot of the website. On the left side you can see the distribution of the datasets per domain for which we were able to do the analysis. Unfortunately, some of them are not well represented.
However, the largest dataset is in the Cross Domain which is en-dbpedia with 2.6B edges. Most datasets in Linguistics domain and Publications.
We did a preliminary analysis of the measures across all domains and could observe dataset- and domain-specific phenomena. I would like to show you two measures, average degree and h-index on the directed graph, which we have plotted across all domains.
Avg. degree is a frequently consulted measure and gives you the average number of edges that vertices have in the graph object. In this plot you can see the datasets and avg. degree values for 5 domains.
The datasets are ordered descending by number of edges
When you look at the plot you can see that avg. degree seems not to be affected by number of edges, in all but GE and GO. GE and GO report an increasing linear relationship.
Outliers can be observed among all domains, like bio2rdf-irefindex in Life Sciences with 63.
In the domain of LI there seems to be two clusters, with one group having higher values than the other. This may be considered as a dataset-specific phenomom, most probably cased by the fact that either
- one data provider used a specific vocabulary and used to model more accurately (more predicates on average), OR
- there were two different providers publishing a lot of small datasets of different kind.
Unfortunately, this may not necessarily be representative, because not all datasets were included.
The second measure that I've plotted here across all domains is the h-index, is known from citation networks.
It is an indicator for the importance of a vertex in a network. Here it is a statistical measure on the graph.
Each dot in the figure is a dataset.
The datasets are ordered descending by number of edges
y-axis is log-scaled
Grows exponentially with the size of the graph. GO, LS report higher values and could be considered more "dense", PU lower values. Again Linguistics shows two clusters with almost constant values, that seem to be independent from the number of edges in most cases, in particular for the lower group of datasets.
Regarding future work, we would like to investigate the domain- and dataset-specific irregularities. Where to they come from, what is the reason etc. and derive implications for modelling tasks, on dataset and application specific level like benchmarking.