Slides from the workshop on Benchmarking RDF Systems co-located with the Extended Semantic Web Conference 2013. The presentation is about an on-going work on building the benchmark for electronic publishing applications. The benchmark provides real-world data sets, the Dutch parliamentary proceedings and a set of analytical SPARQL queries that were built on top of these data sets. The queries were grouped into micro-benchmarks according to their analytical aims. This allows one to perform better analysis of RDF stores behaviors with respect to a certain SPARQL feature used in a micro-benchmark/query.
Preliminary results of running the benchmark on the Virtuoso native RDF store are presented, as well as references to the on-line material including the data sets, queries and the scripts that were used to obtain the results.
ParlBench: a SPARQL-benchmark for electronic publishing applications.
1. ParlBench: a SPARQL-benchmark for electronic
publishing applications
Tatiana Tarasova Maarten Marx
University of Amsterdam
Information and Language Processing Systems
May 26, 2013
Workshop on Benchmarking RDF Systems, ESWC 2013
5. The ParlBench Benchmark
Goal:
→ test performances of RDF store systems in the settings of e-publishing
applications
Components:
→ real-world data: Dutch parliamentary proceedings, members and
political parties
→ vocabulary: Parliamentary Proceedings [2] (ParliPro) + mix of
existing vocabularies
→ 19 analytical SPARQL queries grouped into 4 micro-benchmarks:
Average, Count, Factual and Top 10
6. The ParlBench Benchmark
Goal:
→ test performances of RDF store systems in the settings of e-publishing
applications
Components:
→ real-world data: Dutch parliamentary proceedings, members and
political parties
→ vocabulary: Parliamentary Proceedings [2] (ParliPro) + mix of
existing vocabularies
→ 19 analytical SPARQL queries grouped into 4 micro-benchmarks:
Average, Count, Factual and Top 10
Performance metrics:
→ loading time
→ query response time
10. The ParlBench Data Sets I
PoliticalMashup: characteristics
→ Dutch parliamentary proceedings (1814-2013),
political parties and politicians
→ richly structured XML documents (∼ 54.000)
→ URIs of concepts
→ metadata: who said what and when
→ links to Wikipedia
11. The ParlBench Data Sets I
PoliticalMashup: characteristics
→ Dutch parliamentary proceedings (1814-2013),
political parties and politicians
→ richly structured XML documents (∼ 54.000)
→ URIs of concepts
→ metadata: who said what and when
→ links to Wikipedia
Linked PoliticalMashup: design choices
→ keep the URIs and linking structure
→ re-use existing vocabularies
→ link to the Linked Open Data cloud
→ separate the structure from the text
12. The ParlBench Data Sets II
parties: Dutch political parties
members: members of the Dutch parliament
proceedings: structure of the Dutch parliamentary proceedings
paragraphs: content of speeches of the parliamentary meetings
tagged entities: links from the paragraphs to DBpedia
# of triples
parties members proceedings paragraphs tagged entities total
510 33,885 ∼36.5M ∼11.25M ∼34.4M ∼82.2M
13. RDF Data Model
Parliamentary Proceedings: ParliPro [2], DC and DC Terms [8]
Topic
Stage
Direction
Speech
Paragraph
Scene
Parliament
Member
Political
Party
has part
Parliamentary
Proceedings
has part
has parthas part
references
member
references
party
has part
has part
has part
has part
14. RDF Data Model
Parliament Member: FOAF [4], Bio [3] and DBpedia Ontology [5]
Parliament
Member
DBpedia
resource
same as
Biography
biography
17. RDF Data Model
Tagged Entities: MUTO [6], FOAF [4], Basic WGS84 [7]
Paragraph
Tag
DBpedia
resource
has auto meaning
Person Organization
Spatial
Thing
is a
is a
is a
18. Outline
1 The ParlBench Benchmark
Data Sets
Queries
2 ParlBench experimental run on Virtuoso
19. 19 ParlBench queries: 4 micro-benchmarks
→ 3 Average, e.g.
A0: Retrieve average number of people spoke per topic.
→ 5 Count, e.g.
C4: Count speeches of a female speaker from the topic where only one
female spoke.
→ 6 Factual, e.g.
F3: What is the percentage of female speakers?
→ 5 Top 10, e.g.
T4: Retrieve top 10 longest topics (i.e., number of paragraphs).
20. Outline
1 The ParlBench Benchmark
Data Sets
Queries
2 ParlBench experimental run on Virtuoso
21. ParlBench experimental run
Test Machine
→ MacBook Pro + Mac OS X Lion 10.7.6 x64
→ CPUs: 2.8 GHz Intel Core i7 (2x2 cores)
→ Memory: 8GB
22. ParlBench experimental run
System Under Test
→ Virtuoso Open Source Edition v.06.01.3, native RDF store
→ default Virtuoso index scheme
→ configuration for large data sets loading
23. ParlBench experimental run
Experimental set-up
→ 8 test collections: Parties, Members, scaled Proceedings (from 1 to
100%)
→ single user mode
→ 1 run = 10 permutations of 19 queries (190 queries)
→ warm-up period: 5 runs (950 queries)
→ measuring period: 3 runs (570 queries)
→ query response time: mean of all the permutations of all the runs
(10*3 = 30 runs)
Scaling of proceedings
Scaling Factor 1% 2% 4% 8% 16% 32% 64% 100%
# of triples ∼0.5M ∼1M ∼1.9M ∼3.9M ∼7.6M ∼15M ∼23M ∼36.5M
29. T2: Retrieve top 10 topics with the most speeches
SELECT ?topic COUNT(?speech) as ?numOfSpeeches
WHERE {
?topic rdf:type parlipro:Topic .
?speech rdf:type parlipro:Speech .
{?topic dcterms:hasPart ?speech .}
UNION{
{?topic dcterms:hasPart ?sd .
?sd rdf:type parlipro:StageDirection .
?sd dcterms:hasPart ?speech .}
UNION{
?topic dcterms:hasPart ?scene .
?scene rdf:type parlipro:Scene .
?scene dcterms:hasPart ?speech .}}
GROUP BY ?topic
ORDER BY DESC(?numOfSpeeches)
LIMIT 10
30. Conclusion
→ SPARQL-benchmark for e-publishing applications
→ large collections of real data
→ intuitive analytical queries
→ micro-benchmarks for SPARQL features analysis
Future work
→ enlarge the data sets
- votes in proceedings
- interlink proceedings with the Dutch legislation data set [1] (>280M of
triples)
- tagged entities: more tags
→ extend the queries
- SPARQL 1.1: path expressions
- Linked Open Data integration scenario
→ run the benchmark on other RDF stores
31. Thank you!
ParlBench resources
→ data access:
→ resolvable URIs
→ RDF data dumps at http://data.politicalmashup.nl/RDF/data/
→ experimental run:
website describing an experimental run
http://data.politicalmashup.nl/RDF/
public SPARQL-endpoint to a test collection
http://data.politicalmashup.nl/sparql/
→ scripts are available at
http://data.politicalmashup.nl/RDF/scripts/
→ ParliPro vocabulary:
RDF representation http://purl.org/vocab/parlipro#
HTML representation
http://data.politicalmashup.nl/RDF/vocabularies/parlipro
33. References I
Dutch national regulations in CEN MetaLex
http://doc.metalex.eu/
The Parliamentary Proceedings (ParliPro) Vocabulary
http://purl.org/vocab/parlipro#
BIO: A vocabulary for biographical information
http://vocab.org/bio
The Friend of a Friend Vocabulary (FOAF)
http://xmlns.com/foaf/0.1/
The DBpedia Ontology http://dbpedia.org/ontology/
The Modular Unified Tagging Ontology (MUTO)
http://muto.socialtagging.org/
Basic Geo (WGS84 lat/long) Vocabulary
http://www.w3.org/2003/01/geo/wgs84_pos#
34. References II
Dublin Core Metadata Element Set
http://purl.org/dc/elements/1.1/ and Dublin Core collection
description Terms http://purl.org/dc/terms/
35. Statistics of the benchmark data sets
dataset # of triples size # of files
members 33,885 14M 3,583
parties 510 612K 151
proceedings 36,503,688 4.15G 51,233
paragraphs 11,250,295 5.77G 51,233
tagged entities 34,449,033 2.57G 34,755
TOTAL: 82,237,411 ∼13G 140,955
36. Statistics of the ParlBench data sets
Number of classes: 9
Number of properties: 25
Number of instances per class:
Member: 3,583
Party: 151
Proceedings: 51,233
Topic: 102,289
Stage Direction: 1,776,598
Scene: 189,226
Speech: 2,495,969
Paragraph: 11,211,520
Tagged Entity: 11,383,787
38. Members: example of encoding
nl-dbpedia:Marijke_Vos
owl:sameAs
_:bio
bio:biography
pm:nl.m.02547
foaf:gender
bio:Biography
en-dbpedia:Marijke_Vos
owl:sameAs
dbpedia-
ont:Female
rdf:type
1957-05-04
foaf:birthday
Leidschendam
dbpedia-
ont:birthPlace
Parliament
Member
rdf:type
39. Paragraphs and Tagged Entities: example of encoding
Paragraph
pm:nl.proc.ob.d.h-
tk-19992000-2432-2483.1.7.30.1
parlipro:Paragraph
rdf:type
Blijkbaar is er nu het een en ander mis in de relatie
tussen de Europese Unie en de Russische Federatie. ...
has text
40. Paragraphs and Tagged Entities: example of encoding
Paragraph
pm:nl.proc.ob.d.h-
tk-19992000-2432-2483.1.7.30.1
parlipro:Paragraph
rdf:type
Blijkbaar is er nu het een en ander mis in de relatie
tussen de Europese Unie en de Russische Federatie. ...
has text
Tagged Entity
muto:hasTag
pm:nl.proc.ob.d.h-
tk-19992000-2432-2483.1.7.30.1
_:tag
muto:hasAutoMeaning
nl-dbpedia:Rusland geo:SpatialThing
rdf:type
parlipro:Paragraph
rdf:type