Answers to usual issues in getting started with consuming Linked Data (2010)
A Main Memory Index Structure to Query Linked Data
1. A Main Memory Index Structure
to Query Linked Data
Olaf Hartig
http://olafhartig.de/foaf.rdf#olaf
Frank Huber
Database and Information Systems Research Group
Humboldt-Universität zu Berlin
2. The Issue
no reuse given 0 0,2 0,4 0,6 0,8 1 0 5 10 15 20 25 30 0 20 40 60 80
order
ContactInfoPhillipe
(Query No. 36)
UnsetPropsPhillipe
(Query No. 37)
2ndDegree1Phillipe
(Query No. 38)
2ndDegree2Phillipe
(Query No. 39)
IncomingPhillipe
(Query No. 40)
0 0,2 0,4 0,6 0,8 1 0 5 10 15 20 25 30 0 20 40 60 80
hit rate number of query results query execution time
(in seconds)
Olaf Hartig - A Main Memory Index Structure to Query Linked Data 2
3. The Issue
no reuse given 0
order
0,2 0,4 0,6 0,8 1
Descriptor20 25 30
0 5 10 15
objects 0 20 40 60 80
in the query-local
ContactInfoPhillipe
(Query No. 36) dataset after
query execution:
UnsetPropsPhillipe
(Query No. 37)
172
2ndDegree1Phillipe 533
(Query No. 38)
2ndDegree2Phillipe
(Query No. 39)
IncomingPhillipe
(Query No. 40)
0 0,2 0,4 0,6 0,8 1 0 5 10 15 20 25 30 0 20 40 60 80
hit rate number of query results query execution time
(in seconds)
Olaf Hartig - A Main Memory Index Structure to Query Linked Data 3
4. query-local
Logical representation of dataset
Linked Data from the Web
Physical representation of
Linked Data from the Web
?
What data structure do we use to
physically represent the query-local dataset?
Olaf Hartig - A Main Memory Index Structure to Query Linked Data 4
5. Outline
1. Requirements + Existing Work
2. Data Structures
3. Evaluation
Olaf Hartig - A Main Memory Index Structure to Query Linked Data 5
6. Requirements
● (Consecutively) build and use ad hoc collections of
many small sets of RDF triples
● Four main operations:
● Find … matching triples for a triple pattern in all
descriptor objects
● Add, Remove, Replace … descriptor objects
● Support of concurrent access (i.e. isolation)
● Non-relevant properties:
● Querying descriptor objects individually is not necessary
● No need to write data back to the Web
● ACID properties not required for complete queries
Olaf Hartig - A Main Memory Index Structure to Query Linked Data 6
7. Requirements
● (Consecutively) build and use ad hoc collections of
many small sets of RDF triples
● Four main operations:
● Find … matching triples for a triple pattern in all
descriptor objects
● Add, Remove, Replace … descriptor objects
● Support of concurrent access (i.e. isolation)
● Non-relevant properties:
● Querying descriptor objects individually is not necessary
● No need to write data back to the Web
● ACID properties not required for complete queries
Olaf Hartig - A Main Memory Index Structure to Query Linked Data 7
8. Existing Work
● Disk based storage solutions for RDF data
● Unsuitable due to very costly I/O operations
● Main memory based data structures in the literature
● Focus on a large, single set of RDF triples
● Optimized for complete graph pattern queries or path queries
● Main memory based data structures in RDF frameworks
● Focus on Jena, ARQ and NG4J
● Inefficient (see evaluation)
Olaf Hartig - A Main Memory Index Structure to Query Linked Data 8
9. Outline
1. Requirements + Existing Work
2. Data Structures
3. Evaluation
Olaf Hartig - A Main Memory Index Structure to Query Linked Data 9
10. Hash-Based Index for RDF Data
Logical representation
Physical representation
SP PO SO
● Dictionary:
● Two-way mapping between RDF Dict
terms and numerical identifiers S P O
● 6 hash tables:
● Each hash table contains
all ID-encoded triples
● Efficient support for all types of triple patterns
*Similar to Harth and Decker, 2005
Olaf Hartig - A Main Memory Index Structure to Query Linked Data 10
11. Hash-Based Index for RDF Data
Logical representation
Physical representation
SP PO SO
● Dictionary:
● Two-way mapping between RDF Dict
terms and numerical identifiers S P O
● 6 hash tables:
● Each hash table contains t = ( id ,id ,id )
all ID-encoded triples id s p o
● Efficient support for all types of triple patterns
*Similar to Harth and Decker, 2005
Olaf Hartig - A Main Memory Index Structure to Query Linked Data 11
12. Hash-Based Index for RDF Data
Logical representation
?acq know Find
s
Physical representation
http://bob.name SP PO SO
● Dictionary:
● Two-way mapping between RDF Dict
terms and numerical identifiers S P O
● 6 hash tables:
● Each hash table contains t = ( id ,id ,id )
all ID-encoded triples id s p o
● Efficient support for all types of triple patterns
*Similar to Harth and Decker, 2005
Olaf Hartig - A Main Memory Index Structure to Query Linked Data 12
13. Individual Indexing
query-local
dataset
Logical representation
Physical representation SP PO SO SP PO SO
S P O Dict S P O
SP PO SO
● Idea: Index each descriptor object S P O
separately
● Implementation of the four operations:
● Add, Remove, and Replace are straightforward
● Find requires iterating over all indexes
Olaf Hartig - A Main Memory Index Structure to Query Linked Data 13
14. Individual Indexing
?acq know
Find
s
query-local
http://bob.name dataset
Logical representation
Physical representation SP PO SO SP PO SO
S P O Dict S P O
SP PO SO
● Idea: Index each descriptor object S P O
separately
● Implementation of the four operations:
● Add, Remove, and Replace are straightforward
● Find requires iterating over all indexes
Olaf Hartig - A Main Memory Index Structure to Query Linked Data 14
15. Combined Indexing
query-local
dataset
Logical representation
Physical representation
SP PO SO
● Idea: Use a single index
for all descriptor objects Dict
S P O
● src – maps each triple to a set
of descriptor object IDs
Olaf Hartig - A Main Memory Index Structure to Query Linked Data 15
16. Combined Indexing
query-local
dataset
Logical representation
Physical representation
SP PO SO
● Idea: Use a single index
for all descriptor objects Dict
S P O
● src – maps each triple to a set
of descriptor object IDs
tid = ( ids,idp,ido )
Olaf Hartig - A Main Memory Index Structure to Query Linked Data 16
17. Combined Indexing
query-local
dataset
Logical representation
Physical representation
SP PO SO
● Idea: Use a single index
for all descriptor objects Dict
S P O
● src – maps each triple to a set
of descriptor object IDs
tid = ( ids,idp,ido )
+
src( tid ) = { , }
Olaf Hartig - A Main Memory Index Structure to Query Linked Data 17
18. Quad Indexing
query-local
dataset
Logical representation
Physical representation
SP PO SO
● Idea: Use a single quad index
for all descriptor objects Dict
S P O
● quad = ID-encoded triple
+ descriptor object ID
q = ( (ids,idp,ido) , )
Olaf Hartig - A Main Memory Index Structure to Query Linked Data 18
19. Outline
1. Requirements + Existing Work
2. Data Structures
3. Evaluation
Olaf Hartig - A Main Memory Index Structure to Query Linked Data 19
20. Experiment Setup
Does this affect the overall execution time for
link traversal based query executions ?
Olaf Hartig - A Main Memory Index Structure to Query Linked Data 20
21. Experiment Setup
Does this affect the overall execution time for
link traversal based query executions ?
● Simulation of the Web of Data
● Linked Data server publishes BSBM dataset (scal. factor: 50)
● Adjusted BSBM queries link to the simulation server
● Experiment:
● Sequence of 200 query mixes
● Reuse of the query-local dataset for the whole sequence
● IndIR, CombIR, and QuadIR (as presented), engine: SQUIN
●
NamedGraphSetImpl (NG4J/Jena), engine: SemWeb Client
Olaf Hartig - A Main Memory Index Structure to Query Linked Data 21
22. Execution Time
2500 80
overall number of descr.objects in the queried dataset
NG4J (SWClLib
70 )
2000 IndIR, m=4
60 CombIR, m=12
CombQuadIR,
execution time in seconds m=12
50
1500
40
1000
30
20
500
10
0 0
0 40 80 120 160 200 0 20 40 60 80 100 120 140 160 180 200
query mix query mix
Olaf Hartig - A Main Memory Index Structure to Query Linked Data 22
23. Execution Time
2500 80
overall number of descr.objects in the queried dataset
NG4J (SWClLib
70 )
2000 IndIR, m=4
60 CombIR, m=12
CombQuadIR,
execution time in seconds m=12
50
1500
40
1000
30
20
500
10
0 0
0 40 80 120 160 200 0 20 40 60 80 100 120 140 160 180 200
query mix query mix
Olaf Hartig - A Main Memory Index Structure to Query Linked Data 23
24. Summary
● Three hash index based data structures:
● Individually indexing
● Combined indexing
● Quad indexing
● Findings:
● A single index improves query performance significantly
● Smaller load times with quads
● Also for other use cases of ad hoc storing of Linked Data
● Consecutively retrieved from remote sources
● Used for immediate local processing
Olaf Hartig - A Main Memory Index Structure to Query Linked Data 24
25. Backup Slides
Olaf Hartig - How Caching Improves Efficiency and Result Completeness for Querying Linked Data 25
26. Combined Indexing
query-local
dataset
Logical representation
Physical representation
SP PO SO
● Idea: Use a single index
for all descriptor objects Dict
S P O
● src – maps each triple to a set
of descriptor object IDs
● Implementation of the four operations requires:
● status – maps each descriptor object ID to a status
( BeingIndexed, Indexed, ToBeRemoved, BeingRemoved )
● Find reports a triple t only if: ∃d ∈ src(t) : status(d) = Indexed
Olaf Hartig - A Main Memory Index Structure to Query Linked Data 26
27. Implementation
● Available as Free Software (http://squin.org)
●
Hash tables with n = 2m buckets
● Hash functions:
● hS( is, ip, io ) = is & bitmask[m]
● hSP( is, ip, io ) = ( is • ip ) & bitmask[m]
● etc.
●
Which m ?* *
see paper
● m = 4 for the individual indexes
● m = 12 for the combined index
● m = 12 for the quad index
Olaf Hartig - A Main Memory Index Structure to Query Linked Data 27
28. Experiment Setup
● Comparison without link traversal based query execution
● Compared data structures:
● Our implementation of IndIR, CombIR, CombQuadIR
●
NamedGraphSetImpl in NG4J (Jena)
●
DatasetGraphMap in ARQ (Jena)
● Berlin SPARQL Benchmark (BSBM)
● BSBM datasets partitioned into query-local datasets
● BSBM (v2.0) query mixes executed over these datasets
Olaf Hartig - A Main Memory Index Structure to Query Linked Data 28
29. Required Memory
120
ARQ
NG4J
IndIR (m=4)
100
Estimated required memory in MB
CombIR (m=12)
CombQuad (m=12)
80
60
BSBM number of overall
40 scaling descriptor number
factor objects of triples
50 2,599 22,616
100 4,178 40,133
20
150 5,756 57,524
200 7,329 75,062
0 250 9,873 97,613
0 100 200 300 400 500 300 11,455 115,217
pc (BSBM scaling factor) 350 13,954 137,567
500 18,687 190,502
Olaf Hartig - A Main Memory Index Structure to Query Linked Data 29
30. Execution Time
2500
average execution time per query mix in seconds
ARQ
NG4J
IndIR (m=4)
2000 CombIR (m=12)
CombQuad (m=12)
1500
1000
BSBM number of overall
scaling descriptor number
factor objects of triples
50 2,599 22,616
500
100 4,178 40,133
150 5,756 57,524
200 7,329 75,062
0 250 9,873 97,613
0 100 200 300 400 500
300 11,455 115,217
pc (BSBM scaling factor) 350 13,954 137,567
500 18,687 190,502
Olaf Hartig - A Main Memory Index Structure to Query Linked Data 30
31. Load Time
10
CombIR (m=12)
9 IndIR (m=4)
CombQuad (m=12)
8
average creation time in seconds
7
6
5
4
BSBM number of overall
scaling descriptor number
3 factor objects of triples
50 2,599 22,616
2 100 4,178 40,133
150 5,756 57,524
1
200 7,329 75,062
250 9,873 97,613
0
0 100 200 300 400 500 300 11,455 115,217
pc (BSBM scaling factor) 350 13,954 137,567
500 18,687 190,502
Olaf Hartig - A Main Memory Index Structure to Query Linked Data 31
32. Execution Time
80 60
NG4J (SWClLib)
70 50 IndIR, m=4
CombIR,
m=12
CombIR, m=12
CombQuadIR
CombQuadIR, m=12
60 40 , m=12
execution time in seconds
50
30
40
20
30
10
20
0
0 5 10 15 20 25
10
query mix
0
0 20 40 60 80 100 120 140 160 180 200
query mix
Olaf Hartig - A Main Memory Index Structure to Query Linked Data 32
33. These slides have been created by
Olaf Hartig
http://olafhartig.de
This work is licensed under a
Creative Commons Attribution-Share Alike 3.0 License
(http://creativecommons.org/licenses/by-sa/3.0/)
Olaf Hartig - A Main Memory Index Structure to Query Linked Data 33