ParlBench: a SPARQL-benchmark for electronic publishing applications.

ParlBench: a SPARQL-benchmark for electronic
publishing applications
Tatiana Tarasova Maarten Marx
University of Amsterdam
Information and Language Processing Systems
May 26, 2013
Workshop on Benchmarking RDF Systems, ESWC 2013

MEDIA

PUBLICATIONS

LIFE-‐SCIENCES
CROSS-‐DOMAIN

GEOGRAPHIC

GOVERNMENT

MEDIA

PUBLICATIONS

LIFE-‐SCIENCES
CROSS-‐DOMAIN

GEOGRAPHIC

GOVERNMENT

?

The ParlBench Benchmark
Goal:
→ test performances of RDF store systems in the settings of e-publishing
applications

Goal:
applications
Components:
→ real-world data: Dutch parliamentary proceedings, members and
political parties
→ vocabulary: Parliamentary Proceedings [2] (ParliPro) + mix of
existing vocabularies
→ 19 analytical SPARQL queries grouped into 4 micro-benchmarks:
Average, Count, Factual and Top 10

Goal:
applications
Components:
→ real-world data: Dutch parliamentary proceedings, members and
political parties
→ vocabulary: Parliamentary Proceedings [2] (ParliPro) + mix of
existing vocabularies
→ 19 analytical SPARQL queries grouped into 4 micro-benchmarks:
Average, Count, Factual and Top 10
Performance metrics:
→ loading time
→ query response time

Outline
1 The ParlBench Benchmark
Data Sets
Queries
2 ParlBench experimental run on Virtuoso

The ParlBench Data Sets I
PoliticalMashup: characteristics
→ Dutch parliamentary proceedings (1814-2013),
political parties and politicians
→ richly structured XML documents (∼ 54.000)
→ URIs of concepts
→ metadata: who said what and when
→ links to Wikipedia

The ParlBench Data Sets I
PoliticalMashup: characteristics
→ Dutch parliamentary proceedings (1814-2013),
political parties and politicians
→ richly structured XML documents (∼ 54.000)
→ URIs of concepts
→ metadata: who said what and when
→ links to Wikipedia
Linked PoliticalMashup: design choices
→ keep the URIs and linking structure
→ re-use existing vocabularies
→ link to the Linked Open Data cloud
→ separate the structure from the text

The ParlBench Data Sets II
parties: Dutch political parties
members: members of the Dutch parliament
proceedings: structure of the Dutch parliamentary proceedings
paragraphs: content of speeches of the parliamentary meetings
tagged entities: links from the paragraphs to DBpedia
# of triples
parties members proceedings paragraphs tagged entities total
510 33,885 ∼36.5M ∼11.25M ∼34.4M ∼82.2M

RDF Data Model
Parliamentary Proceedings: ParliPro [2], DC and DC Terms [8]
Topic
Stage
Direction
Speech
Paragraph
Scene
Parliament
Member
Political
Party
has part
Parliamentary
Proceedings
has part
has parthas part
references
member
references
party
has part
has part
has part
has part

RDF Data Model
Parliament Member: FOAF [4], Bio [3] and DBpedia Ontology [5]
Parliament
Member
DBpedia
resource
same as
Biography
biography

RDF Data Model
Parties: ParliPro [2]
Political
Party
DBpedia
resource
same as

RDF Data Model
Paragraphs: ParliPro [2]
Paragraph
Content of the
paragraph
has text

RDF Data Model
Tagged Entities: MUTO [6], FOAF [4], Basic WGS84 [7]
Paragraph
Tag
DBpedia
resource
has auto meaning
Person Organization
Spatial
Thing
is a
is a
is a

19 ParlBench queries: 4 micro-benchmarks
→ 3 Average, e.g.
A0: Retrieve average number of people spoke per topic.
→ 5 Count, e.g.
C4: Count speeches of a female speaker from the topic where only one
female spoke.
→ 6 Factual, e.g.
F3: What is the percentage of female speakers?
→ 5 Top 10, e.g.
T4: Retrieve top 10 longest topics (i.e., number of paragraphs).

ParlBench experimental run
Test Machine
→ MacBook Pro + Mac OS X Lion 10.7.6 x64
→ CPUs: 2.8 GHz Intel Core i7 (2x2 cores)
→ Memory: 8GB

System Under Test
→ Virtuoso Open Source Edition v.06.01.3, native RDF store
→ default Virtuoso index scheme
→ conﬁguration for large data sets loading

Experimental set-up
→ 8 test collections: Parties, Members, scaled Proceedings (from 1 to
100%)
→ single user mode
→ 1 run = 10 permutations of 19 queries (190 queries)
→ warm-up period: 5 runs (950 queries)
→ measuring period: 3 runs (570 queries)
→ query response time: mean of all the permutations of all the runs
(10*3 = 30 runs)
Scaling of proceedings
Scaling Factor 1% 2% 4% 8% 16% 32% 64% 100%
# of triples ∼0.5M ∼1M ∼1.9M ∼3.9M ∼7.6M ∼15M ∼23M ∼36.5M

Loading Time, log2
(time, sec)
1 2 4 8 16 32 64 100
1
2
4
8
16
32
64
128
256
512
1024
2048
4096
Size of proceedings, %
Time,sec

Query Response Time by Micro-Benchmarks,
log2
(SUM(time), sec)
1 2 4 8 16 32 64 100
0.25
0.5
1
2
4
8
16
32
64
128
256
Size of proceedings, %
Sumofexecutiontime,sec
average
count
factual
top10

Query Response Time on the Largest Collection (∼36M)
A0 A1 A2 C0 C1 C2 C3 C4 F0 F1 F2 F3 F4 F5 T0 T1 T2 T3 T4
0
10
20
30
40
50
60
70
80
90
100
110
120
130
140
150
160
170
Queries
Time,sec
45.9422
39.5885
47.1268
2.4212
10.6883
1.4383 0.8649
30.0118
7.9996
78.1858
22.377822.4192
0.1053
48.8887
0.8357
10.2813
41.6915
0.9241
168.1313
average
count
factual
top10

T4: Retrieve top 10 longest topics (i.e., number of
paragraphs).
SELECT ?topic COUNT(?par) as ?numOfPars
WHERE {
?topic rdf:type parlipro:Topic .
?speech rdf:type parlipro:Speech .
?speech dcterms:hasPart ?par .
?par rdf:type parlipro:Paragraph .
{?topic dcterms:hasPart ?speech .}
UNION{
?topic dcterms:hasPart ?sd .
?sd rdf:type parlipro:StageDirection .
?sd dcterms:hasPart ?speech .}
UNION{
?topic dcterms:hasPart ?scene .
?scene rdf:type parlipro:Scene .
?scene dcterms:hasPart ?speech .}}
GROUP BY ?topic
ORDER BY DESC(?numOfPars)
LIMIT 10

Characteristics of ParlBench queries
micro benchmark
Average Count Factual Top 10
A0 A1 A2 C0 C1 C2 C3 C4 F0 F1 F2 F3 F4 F5 T0 T1 T2 T3 T4
FILTER + + + + + + + +
UNION + + + + + + + + +
LIMIT + + + + + + +
ORDER BY + + + + + + +
GROUP BY + + + + + + + + + + + +
COUNT + + + + + + + + + + + + + + + + +
DISTINCT + + + +
AVG + + +
negation +
OPTIONAL + +
subquery + + + + + + +
blank node scoping + + + + + + + + +
# of triple patterns 10 9 12 5 5 5 6 13 8 16 6 6 2 4 2 4 9 3 11

T2: Retrieve top 10 topics with the most speeches
SELECT ?topic COUNT(?speech) as ?numOfSpeeches
WHERE {
?topic rdf:type parlipro:Topic .
?speech rdf:type parlipro:Speech .
{?topic dcterms:hasPart ?speech .}
UNION{
{?topic dcterms:hasPart ?sd .
?sd rdf:type parlipro:StageDirection .
?sd dcterms:hasPart ?speech .}
UNION{
?topic dcterms:hasPart ?scene .
?scene rdf:type parlipro:Scene .
?scene dcterms:hasPart ?speech .}}
GROUP BY ?topic
ORDER BY DESC(?numOfSpeeches)
LIMIT 10

Conclusion
→ SPARQL-benchmark for e-publishing applications
→ large collections of real data
→ intuitive analytical queries
→ micro-benchmarks for SPARQL features analysis
Future work
→ enlarge the data sets
- votes in proceedings
- interlink proceedings with the Dutch legislation data set [1] (>280M of
triples)
- tagged entities: more tags
→ extend the queries
- SPARQL 1.1: path expressions
- Linked Open Data integration scenario
→ run the benchmark on other RDF stores

Thank you!
ParlBench resources
→ data access:
→ resolvable URIs
→ RDF data dumps at http://data.politicalmashup.nl/RDF/data/
→ experimental run:
website describing an experimental run
http://data.politicalmashup.nl/RDF/
public SPARQL-endpoint to a test collection
http://data.politicalmashup.nl/sparql/
→ scripts are available at
http://data.politicalmashup.nl/RDF/scripts/
→ ParliPro vocabulary:
RDF representation http://purl.org/vocab/parlipro#
HTML representation
http://data.politicalmashup.nl/RDF/vocabularies/parlipro

References I
Dutch national regulations in CEN MetaLex
http://doc.metalex.eu/
The Parliamentary Proceedings (ParliPro) Vocabulary
http://purl.org/vocab/parlipro#
BIO: A vocabulary for biographical information
http://vocab.org/bio
The Friend of a Friend Vocabulary (FOAF)
http://xmlns.com/foaf/0.1/
The DBpedia Ontology http://dbpedia.org/ontology/
The Modular Uniﬁed Tagging Ontology (MUTO)
http://muto.socialtagging.org/
Basic Geo (WGS84 lat/long) Vocabulary
http://www.w3.org/2003/01/geo/wgs84_pos#

References II
Dublin Core Metadata Element Set
http://purl.org/dc/elements/1.1/ and Dublin Core collection
description Terms http://purl.org/dc/terms/

Statistics of the benchmark data sets
dataset # of triples size # of ﬁles
members 33,885 14M 3,583
parties 510 612K 151
proceedings 36,503,688 4.15G 51,233
paragraphs 11,250,295 5.77G 51,233
tagged entities 34,449,033 2.57G 34,755
TOTAL: 82,237,411 ∼13G 140,955

Statistics of the ParlBench data sets
Number of classes: 9
Number of properties: 25
Number of instances per class:
Member: 3,583
Party: 151
Proceedings: 51,233
Topic: 102,289
Stage Direction: 1,776,598
Scene: 189,226
Speech: 2,495,969
Paragraph: 11,211,520
Tagged Entity: 11,383,787

Parliamentary Proceedings: example of encoding
parlipro:Parliamentary
Proceedings
pm:nl.proc.ob.d.h-
tk-19992000-2432-2483
rdf:type
pm:nl.proc.ob.d.h-
tk-19992000-2432-2483.1 parlipro:Topic
dcterms:hasPart
pm:nl.proc.ob.d.h-
tk-19992000-2432-2483.1.7.30
parlipro:Speech
rdf:type
dcterms:hasPart
pm:nl.proc.ob.d.h-
tk-19992000-2432-2483.1.7.30.1
parlipro:Paragraph
rdf:type
dcterms:hasPart
pm:nl.p.gl
pm:nl.m.02547
parlipro:refMember
parlipro:refParty
1999-12-08
rdf:type
dc:date
pm:nl.proc.ob.d.h-
tk-19992000-2432-2483.1.7 parlipro:Scene
rdf:type
dcterms:hasPart
…

Members: example of encoding
nl-dbpedia:Marijke_Vos
owl:sameAs
_:bio
bio:biography
pm:nl.m.02547
foaf:gender
bio:Biography
en-dbpedia:Marijke_Vos
owl:sameAs
dbpedia-
ont:Female
rdf:type
1957-05-04
foaf:birthday
Leidschendam
dbpedia-
ont:birthPlace
Parliament
Member
rdf:type

Paragraphs and Tagged Entities: example of encoding
Paragraph
pm:nl.proc.ob.d.h-
tk-19992000-2432-2483.1.7.30.1
parlipro:Paragraph
rdf:type
Blijkbaar is er nu het een en ander mis in de relatie
tussen de Europese Unie en de Russische Federatie. ...
has text

Paragraphs and Tagged Entities: example of encoding
Paragraph
pm:nl.proc.ob.d.h-
tk-19992000-2432-2483.1.7.30.1
parlipro:Paragraph
rdf:type
Blijkbaar is er nu het een en ander mis in de relatie
tussen de Europese Unie en de Russische Federatie. ...
has text
Tagged Entity
muto:hasTag
pm:nl.proc.ob.d.h-
tk-19992000-2432-2483.1.7.30.1
_:tag
muto:hasAutoMeaning
nl-dbpedia:Rusland geo:SpatialThing
rdf:type
parlipro:Paragraph
rdf:type

ParlBench: a SPARQL-benchmark for electronic publishing applications.

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Was ist angesagt?

Was ist angesagt? (20)

Ähnlich wie ParlBench: a SPARQL-benchmark for electronic publishing applications.

Ähnlich wie ParlBench: a SPARQL-benchmark for electronic publishing applications. (20)

Kürzlich hochgeladen

Kürzlich hochgeladen (20)

ParlBench: a SPARQL-benchmark for electronic publishing applications.