Opportunistic Linked Data Querying through Approximate Membership Metadata

Opportunistic Linked Data
Querying through Approximate
Membership Metadata
Miel Vander Sande

“Solve a query for a client,  
and it will be happy for a day. 
Teach a client to SPARQL,  
and it’ll query happily ever after.”
!
— Confucius, 431 BC

Linked Data Fragments: a uniform view 
on publishing Linked Data
Exploring the axis: selector and metadata
Approximate Membership Metadata
Querying through Approximate
Membership Metadata
Opportunistic Querying

Interaction between client & server. 
The hunt for trade-offs: What can we learn?
high server costlow server cost
data 
dump
SPARQL 
endpoint
interface offered by the server
high availability low availability
high bandwidth low bandwidth
out-of-date data live data
low client costhigh client cost

Linked Data Fragments are 
a uniform view on Linked Data interfaces.
data 
dump
SPARQL 
endpoint
interface offered by the server
Every Linked Data interface 
offers specific fragments 
of a Linked Data set.

data
metadata
controls
What triples does it contain?
What do we know about it?
How to access more data?
Each type of Linked Data Fragment 
is defined by three characteristics.

all dataset triples
(none)
data dump
number of triples, ﬁle size
data
metadata
controls

triples matching the query
(none)
(none)
SPARQL query result
data
metadata
controls

low server cost
data 
dump
SPARQL 
query results
high availability
live data
Linked Data 
documents
triple pattern 
fragments
You have to start somewhere:  
Triple Pattern Fragments.
Verborgh, R., Hartig, O.,…: Querying datasets on the Web
with high availability. ISWC2014
high bandwidth

data (first 100)
controls (other fragments)
metadata (total count)

controls
Triple pattern fragment servers 
enable clients to be intelligent.
<http://fragments.dbpedia.org/2014/en#dataset> hydra:search [
hydra:template "http://fragments.dbpedia.org/2014/en
{?subject,predicate,object}";
hydra:mapping
[ hydra:variable "subject"; hydra:property rdf:subject ],
[ hydra:variable "predicate"; hydra:property rdf:predicate ],
[ hydra:variable "object"; hydra:property rdf:object ]
].
The RDF representation explains: 
“you can query by triple pattern”.

The RDF representation explains: 
“this is the number of matches”.
metadata
Triple pattern fragment servers 
enable clients to be intelligent.
<#fragment> void:triples 8141.

Give them a SPARQL query. 
Give them a URL of any dataset fragment.
How can intelligent clients 
solve SPARQL queries over fragments?
They look inside the fragment 
to see how to access the dataset
and use the metadata 
to decide how to plan the query.

The client splits the query 
into the available fragments.
SELECT ?artist ?name WHERE {
?artist a dbpedia-owl:Artist;
rdfs:label ?name;
dbpedia-owl:birthPlace dbpedia:Padua.
FILTER LANGMATCHES(LANG(?name), "EN")
}

The client gets the fragments 
and inspects their metadata.
?artist a dbpedia-owl:Artist.
first 100 triples
96,000
?artist rdfs:label ?name.
first 100 triples
12,000,000
?artist dbont:birthPlace dbpedia:Padua.
first 100 triples
135

?artist a dbpedia-owl:Artist. 96.000
?artist rdfs:label ?name. 12.000.000
?artist dbont:birthPlace dbpedia:Padua.
dbpedia:Alberto_Benettin dbont:birthPlace dbpedia:Padua.
135
dbpedia:Alberto_Bigon dbont:birthPlace dbpedia:Padua.
The metadata enables the client 
to choose the right starting point.
dbp:Alberto_Benettin a dbont:Artist.
dbp:Alberto_Benettin rdfs:label ?name.

For some patterns, many requests are
of type “is this triple in the dataset?”
Fractionofmembershipqueries
0%
25%
50%
75%
100%
L1 L2 L3 L4 L5 S1 S2 S3 S4 S5 S6 S7 F1 F2 F3 F4 F5 C1 C2 C3
20 WatDiv queries 
linear (L), star (S), snowflake-shaped (F) and complex (C)

Advancing in selector and/or metadata
dimensions.
metadata
selector
Triple Pattern Fragments
low server cost
high availability
live data
high bandwidth
Simple 
Questions
Complex  
Questions
No information  
for the client
Extensive useful 
information for the client

dimensions.
metadata
selector
Substring search
J Van Herwegen et. al.:
Substring Filtering for Low-Cost
Linked Data Interfaces 
Last talk of this session!

dimensions.
metadata
selector
Substring search
Approximate Membership 
Function (AMF)

Append TPF response with a compact
representation of all possible mappings.
metadata
Approximate Membership Function (AMF)
Approximate set membership assessment
with a predefined false positive probability.
Bloom filter / Golomb-coded set
+

“Can we reduce the number of HTTP requests?”
“Can we reduce the total execution time?”
“What is the overhead on server CPU load?”

Bloom Filter
Golomb-coded set (GCS)
0 0 0 0 1 0 0 0 0 0 0 1 0 0 0 0 … 0 1 0
!
!
n0 dbpedia:Alberto_Benettin
n1 dbpedia:Alberto_Bigon
nx …
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 … 0 0 0
m
0 1 0 0 1 0 0 0 1 0 0 1 0 0 1 0 … 0 1 0
k0 k1 kx
k0 k1 kx
0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 … 0 0 0
!
n0 dbpedia:Alberto_Benettin
n1 dbpedia:Alberto_Bigon
k
0 1 0 0 1 0 0 0 0 0 0 0 0 0 0 0 … 0 0 0
k
0 1 0 1 1 0 1Golomb coded
Geometric distribution

“this BloomFilter with false positive
probability X and hash function Y
represents the presence of all bindings for ?s”.
metadata
Server enables clients to avoid  
membership requests.
<#fragment> void:triples 96300. # existing count metadata
_:membershipFunction a ms:BloomFilter; # AMF metadata
ms:hashSize 524288;
ms:hashFunction <MyMurmur1>, <MyMurmur2>;
ms:memberCollection [
ms:sourceCollection <#fragment>;
ms:projectedProperty rdf:subject
];
ms:falsePositiveRate 0.05;
ms:falseNegativeRate 0.0;
ms:binaryRepresentation "QmF...ZTY"^^xsd:base64Binary.

GET ?artist dbont:birthPlace dbpedia:Padua.
dbpedia:Alberto_Benettin dbont:birthPlace dbpedia:Padua.
135
…
Client filters non-members locally  
with one extra (cached) request
GET dbpedia:Alberto_Benettin a dbont:Artist. 0
GET dbpedia:Alberto_Bigon a dbont:Artist. 1
GET dbpedia:Alberto_Da_Zara a dbont:Artist. 1
GET dbpedia:Alberto_Gallo a dbont:Artist. 0
GET dbpedia:Alberto_Bigon a dbont:Artist. 1
GET ?artist a dbont:Artist.
Approx.MembershipFilt.
GET …

We evaluated for request count, server
cost and speedup in a Web setting.
BloomFilter: MurMurHash3, GCS: FNV-1
1 HTTP Cache with 1 Mbps
p = 1/1024 (0.1%) , 1/128 (1%), 1/64 (1.6%)
250 queries from 125 diverse WatDiv
templates on Amazon EC2 machine
WatDiv 100M triples dataset
Timeout: 3min

We evaluated for request count, server
cost and speedup in a Web setting.
vs. vanilla TPF server & client
Original “greedy” algorithm 
Optimized join-tree algorithm*
250 queries from 125 diverse WatDiv
templates on Amazon EC2 machine
* Van Herwegen, et. al.: Query Execution Optimization for
Clients of Triple Patterns Fragments. ESWC2015
2 client algorithms:

> 50% of the queries has fewer requests, 
< 20% has more requests.
Greedy Bloom
Greedy GCS
Optimized Bloom
Optimized GCS
Percentage of queries (p = 1/1024)
0% 25% 50% 75% 100%
6%
5%
18%
17%
59%
62%
49%
50%
35%
33%
33%
32%
Equal Fewer Requests More Requests

Queries with relatively many HTTP req.
(45,000+ / query) benefit greatly
Diﬀerencein#Requests
0
4,000
8,000
12,000
16,000
Fewer Requests More Requests
Greedy Bloom Greedy GCS Optimized Bloom Optimized GCS
< 35

No queries have reduction in execution
time, a third even has increase.
Greedy Bloom
Greedy GCS
Optimized Bloom
Optimized GCS
Percentage of queries (p = 1/1024)
0% 25% 50% 75% 100%
16%
31%
33%
38%0%
84%
69%
67%
62%
Equal Lower Execution time Higher Execution time

Server remains low-cost, as impact is  
very acceptable (< 6%).
CPU(%)
0
7.5
15
22.5
30
O
riginal
Bloom
(1/1024)
Bloom
(1/128)
Bloom
(1/64)
G
CS
(1/1024)
G
CS
(1/128)
G
CS
(1/64)
11.110.810.2
14.9
11.210.8
9.2

During execution, a result candidate
could already be correct (1 - p).
Can we be opportunistic here, and
temporarily allow imprecise results?

“Can we reduce the time to 100% recall?”
Opportunistic Linked Data Querying 13
only allow
certain results
temporarily allow
uncertain results
start
execution
start
execution
1st result
computed
1st result
computed
n < r results
computed
n < r results
computed
r results
computed
r results
computed
r + f results
computed
0% recall 100% recall 100% recall
100% precision
Fig. 2. This SPARQL query execution timeline compares regular and opportunistic
query execution, assuming r total query results and f false positives. Note how
both approaches achieve 100% recall and precision at a shared point in the end, but
there exists a period during which only opportunistic execution reaches 100% recall
(shaded).
need to be discarded. The user thus sees the photos faster than if they
had only been retrieved after full precision was achieved. This example

Temporarily allowing <100% precision  
can reduce 100% recall time with 1/3.
Executiontime(s)
0
35
70
105
140
Greedy + Bloom (p = 1/1024)
100% Recall 100% Precision
Number of revoked results was 0 or 1.

For some queries types, bandwidth highly
decreases for TPF query execution.
Approximate Membership Metadata  
is a nuanced debate
For larger fragments, realtime computation
hurts execution time. We expect gain with  
pre-caching and out-of-band delivery.
Opportunistic querying is a promising direction
for further exploration.

TRIPLE PATTERN
fragments
data
APPR. MEM. FILT.
No one size fits all, explore the axis. 
Find metrics that fit your use-case.
Client & Server load 
Request & Response size 
Protocol (HTTP) impact 
…
Try you own trade-oﬀ
server at our demo (and
get a nice cup of coﬀee).
Start serving Linked Data like a barista

Opportunistic Linked Data Querying through Approximate Membership Metadata

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Was ist angesagt?

Was ist angesagt? (20)

Andere mochten auch

Andere mochten auch (20)

Ähnlich wie Opportunistic Linked Data Querying through Approximate Membership Metadata

Ähnlich wie Opportunistic Linked Data Querying through Approximate Membership Metadata (20)

Mehr von Miel Vander Sande

Mehr von Miel Vander Sande (15)

Kürzlich hochgeladen

Kürzlich hochgeladen (20)

Opportunistic Linked Data Querying through Approximate Membership Metadata