Overview of how data on the Web of Data can be consumed (first and foremost Linked Data) and implications for the development of usage mining approaches.
References:
Elbedweihy, K., Mazumdar, S., Cano, A. E., Wrigley, S. N., & Ciravegna, F. (2011). Identifying Information Needs by Modelling Collective Query Patterns. COLD, 782.
Elbedweihy, K., Wrigley, S. N., & Ciravegna, F. (2012). Improving Semantic Search Using Query Log Analysis. Interacting with Linked Data (ILD 2012), 61.
Raghuveer, A. (2012). Characterizing machine agent behavior through SPARQL query mining. In Proceedings of the International Workshop on Usage Analysis and the Web of Data, Lyon, France.
Arias, M., Fernández, J. D., Martínez-Prieto, M. A., & de la Fuente, P. (2011). An empirical study of real-world SPARQL queries. arXiv preprint arXiv:1103.5043.
Hartig, O., Bizer, C., & Freytag, J. C. (2009). Executing SPARQL queries over the web of linked data (pp. 293-309). Springer Berlin Heidelberg.
Verborgh, R., Hartig, O., De Meester, B., Haesendonck, G., De Vocht, L., Vander Sande, M., ... & Van de Walle, R. (2014). Querying datasets on the web with high availability. In The Semantic Web–ISWC 2014 (pp. 180-196). Springer International Publishing.
Verborgh, R., Vander Sande, M., Colpaert, P., Coppens, S., Mannens, E., & Van de Walle, R. (2014, April). Web-Scale Querying through Linked Data Fragments. In LDOW.
Luczak-Rösch, M., & Bischoff, M. (2011). Statistical analysis of web of data usage. In Joint Workshop on Knowledge Evolution and Ontology Dynamics (EvoDyn2011), CEUR WS.
Luczak-Rösch, M. (2014). Usage-dependent maintenance of structured Web data sets (Doctoral dissertation, Freie Universität Berlin, Germany), http://edocs.fu-berlin.de/diss/receive/FUDISS_thesis_000000096138.
1. Web of Data Usage Mining
Markus Luczak-Roesch
@mluczak | http://markus-luczak.de
2. What you should learn:
• describe the architectural differences between content
negotiation and Linked Data queries;
• develop applications that use different strategies to
consume Linked Data;
• develop usage mining methods that exploit the atomic parts
of the SPARQL query language.
3. Linked Data principles
1. Use URIs as names for
“Things” (resources).
2. Use HTTP URIs to allow the
access to resources on the
Web.
3. On resource access, deliver
meaningful information
conforming to Web standards
(RDF, SPARQL).
4. Set RDF links to resources
published by other parties to
allow the discovery of more
resources.
http://dbpedia.org/resource/Berlin
http://dbpedia.org/page/Berlin
http://dbpedia.org/data/Berlin
yago-res:Berlin
S
owl:sameAs
P
dbpedia:Berlin O
h"p://www.w3.org/DesignIssues/LinkedData.html
Content Negotiation
4. Linked Data principles
1. Use URIs as names for
“Things” (resources).
2. Use HTTP URIs to allow the
access to resources on the
Web.
3. On resource access, deliver
meaningful information
conforming to Web standards
(RDF, SPARQL).
4. Set RDF links to resources
published by other parties to
allow the discovery of more
resources.
h"p://www.w3.org/DesignIssues/LinkedData.html
http://dbpedia.org/resource/Berlin
http://dbpedia.org/page/Berlin
http://dbpedia.org/data/Berlin
yago-res:Berlin
S
owl:sameAs
P
dbpedia:Berlin O
Content Negotiation
5. Linked Data principles
1. Use URIs as names for
“Things” (resources).
2. Use HTTP URIs to allow the
access to resources on the
Web.
3. On resource access, deliver
meaningful information
conforming to Web standards
(RDF, SPARQL).
4. Set RDF links to resources
published by other parties to
allow the discovery of more
resources.
h"p://www.w3.org/DesignIssues/LinkedData.html
http://dbpedia.org/resource/Berlin
http://dbpedia.org/page/Berlin
http://dbpedia.org/data/Berlin
yago-res:Berlin
S
owl:sameAs
P
dbpedia:Berlin O
Content Negotiation
6. Linked Data principles
1. Use URIs as names for
“Things” (resources).
2. Use HTTP URIs to allow the
access to resources on the
Web.
3. On resource access,
deliver meaningful
information conforming to
Web standards (RDF,
SPARQL).
4. Set RDF links to resources
published by other parties to
allow the discovery of more
resources.
h"p://www.w3.org/DesignIssues/LinkedData.html
http://dbpedia.org/resource/Berlin
http://dbpedia.org/page/Berlin
http://dbpedia.org/data/Berlin
yago-res:Berlin
S
owl:sameAs
P
dbpedia:Berlin O
Content Negotiation
7. Linked Data principles
1. Use URIs as names for
“Things” (resources).
2. Use HTTP URIs to allow the
access to resources on the
Web.
3. On resource access, deliver
meaningful information
conforming to Web standards
(RDF, SPARQL).
4. Set RDF links to resources
published by other parties
to allow the discovery of
more resources.
h"p://www.w3.org/DesignIssues/LinkedData.html
http://dbpedia.org/resource/Berlin
http://dbpedia.org/page/Berlin
http://dbpedia.org/data/Berlin
yago-res:Berlin
S
owl:sameAs
P
dbpedia:Berlin O
Content Negotiation
19. Consuming Linked Data
• stateless
• request-response
t
Client Server
request
response
TCP life cycle
derived from R. Tolksdorf
Open connection
Close connection
20. Consuming Linked Data
GET / HTTP/1.1
User-Agent: Mozilla/5.0 … Firefox/10.0.3
Host: markus-luczak.de:80
Accept: */*
HTTP/1.1 200 OK
Server: Apache/2.0.49
Content-Language: en
Content-Type: text/html
Content-length: 2990
<!DOCTYPE html>
<html xml:lang="en"
…
Client
Server
derived from R. Tolksdorf
22. Consuming Linked Data
• Discover URIs
– Lookup services
• http://rkbexplorer.com
– Web of Data search engines
• http://sindice.com
• http://ws.nju.edu.cn/falcons/objectsearch/index.jsp
23. Consuming Linked Data
• Discover additional data for the resource at hand
• follow links („follow your nose“)
– rdfs:seeAlso
– owl:sameAs
• Co-Reference services
– http://sameas.org
• Web of Data search engines
28. SPARQL queries on the Web
• RESTful service endpoint
GET /sparql?query=PREFIX+rdf… HTTP/1.1
Host: dbpedia.org
h"p://www.w3.org/TR/rdf-sparql-XMLres/ h"p://www.w3.org/TR/rdf-sparql-json-res/
30. Querying Linked Data
• distribution of data creates challenges for querying them
• Query approaches
– follow-up queries ß application-dependent, proprietary
– query a central data repository (e.g. LOD cache) ß trivial
– federated queries ß more interesting
• idea: query a mediator that distributes the sub-queries and returns
aggregated result (as of SPARQL 1.1)
– link traversal ß very interesting
• idea: follow links in the results retrieved from a source to expand the data
dynamically
41. Statistical analysis
(a) SWC (b) DBpedia (c) LGD
Abbildung 20: Nutzung der Konzepte der Multi-Ontologien (Kanten sind ausgeblendet)
Quelle: eigene Darstellung
dieser Datensets besitzt noch ein großes Verbesserungspotential. Beispielsweise sind di
M¨oglichkeiten gegeben, eine h¨ohere Anzahl an speziellen Konzepten zu nutzen. Eben
so k¨onnen theoretisch mehr Konzepte aus anderen Bereichen als Personen, Orte unSource: Masterthesis of Markus Bischoff
42. Estimating the effects of change
o be added to the DBpedia 3.4 data set conforming to our approach16
.
able 7.14: Recommended predicates to be added to the data set and the estimate
↵ects of change.
Primitive to add E↵ects of change Exists in data set
dbp:manufacturer 0.004505372 x
dbp:firstFlight 0.004505372 x
dbp:introduced 0.004505372 x
dbp:nationalOrigin 0.004505372
dbo:thumbnail 0.021986718 x
dbo:director 0.025047524
dbp:director 0.02503915 x
dbp:abstract 0.025797024 x
dbo:starring 0.034066643
dbp:starring 0.034066643 x
dbp:stars 0.034066643 x
skos:Concept 0.040946128 x
skos:broader 0.04116386 x
dbp:redirect 0.066441677 x
47. SPARQL
7. Evaluation
The visualization shows how primitives on the left hand side (LHS) of a rule imply
particular ones on the right hand side (RHS) and which likelihood such an associa-
tion has. In our specific case this allows us to analyze which primitives are queried
together frequently in failing queries. We spot two characteristic usage patterns: (1)
the properties and classes queried in the context of http://dbpedia.org/ontology/
Aircraft; (2) the properties and classes queried in the context of an object variable.
These can be further analyzed by exporting the association rules to GraphML and vi-
sualizing the network by use of a network visualization and analysis tool like Gephi15
for example. Figure 7.13 depicts one filtered network representation for our example
case. Nodes with a degree lower than 5 are filtered out (k-core network with k = 5)
to derive a well-arranged visualization of the most important primitives in failing
queries. Nodes represent LHS and RHS of the computed rules. Edges point from the
LHS to the RHS of the particular rules.
Figure 7.13: Filtered visualization of the association rule network (k-core 5 filter
applied to reduce nodes with degree lower than 5).
Table 7.14 lists the an exemplary set of primitives which would be recommended
15http://gephi.org/
177
{
?s1 foaf:name “Markus Luczak-Roesch”.
?s1 rdf:type dbp:Person
}
h"p://markus-luczak.de#me
“Markus Luczak-Roesch“
rdf:type
foaf:name
foaf:Person
✔
✗
✗
query applied to dataset
The server can trace detailed usage.
48. Linked Data Fragments
Querying Datasets on the Web with High Availability 5
generic requests
high client effort
high server availability
specific requests
high server effort
low server availability
data
dump
Linked Data
document
sparql
result
triple pattern
fragments
various types of
Linked Data Fragments
Fig. 1: All http triple interfaces offer Linked Data Fragments of a dataset. They differ
in the specificity of the data they contain, and thus the effort needed to create them.
3.2 Formal definitions
As a basis for our formalization, we use the following concepts of the rdf data
model [16] and the sparql query language [12]. We write U, B, L, and V to
denote the sets of all uris, blank nodes, literals, and variables, respectively.
Then, T = (U [ B) ⇥ U ⇥ (U [ B [ L) is the (infinite) set of all rdf triples. Any
tuple tp 2 (U [ V) ⇥ (U [ V) ⇥ (U [ L [ V) is a triple pattern. Any finite set of
such triple patterns is a basic graph pattern (bgp). Any more complex sparql
graph pattern, typically denoted by P, combines triple patterns (or bgps) using
specific operators [12,20]. The standard (set-based) query semantics for sparql
defines the query result of such a graph pattern P over a set of rdf triples
G ✓ T as a set that we denote by [[P]]G and that consists of partial mappings
µ : V ! (U [ B [ L), which are called solution mappings. An rdf triple t is
a matching triple for a triple pattern tp if there exists a solution mapping µ
such that t = µ[tp], where µ[tp] denotes the triple (pattern) that we obtain by
replacing the variables in tp according to µ.
For the sake of a more straightforward formalization, in this paper, we as-
sume without loss of generality that every dataset G published via some kind of
fragments on the Web is a finite set of blank-node-free rdf triples; i.e., G ✓ T ⇤
where T ⇤
= U ⇥ U ⇥ (U [ L). Each fragment of such a dataset contains triples
that somehow belong together; they have been selected based on some condition,
which we abstract through the notion of a selector:
T
xxx.xxx.xxx.xxx - - [17/Oct/2014:07:43:02 +0000]
"GET /2014/en?subject=&predicate=&object=dbpedia%3AAustin HTTP/1.1" 200
1309 "http://fragments.dbpedia.org/2014/en" …
fetches the first page of the corresponding ldf. This page contains the cnt meta-
data, which tells us how many matches the dataset has for each triple pattern.
The pattern is then decomposed by evaluating it using a) a triple pattern iter-
ator for the triple pattern with the smallest number of matches, and b) a new
bgp iterator for the remainder of the pattern. This results in a dynamic pipeline
for each of the mappings of its predecessor, as visualized in Fig. 2. Each pipeline
is optimized locally for a specific mapping, reducing the number of requests.
To evaluate a sparql query over a triple pattern fragment collection, we pro-
ceed as follows. For each bgp of the query, a bgp iterator is created. Dedicated
iterators are necessary for other sparql constructs such as UNION and OPTIONAL,
but their implementation need not be ldf-specific; they can reuse the triple
pattern fragment bgp iterators. The predecessor of the first iterator is a start
iterator. We continuously pull solution mappings from the last iterator in the
pipeline and output them as solutions of the query, until the last iterator re-
sponds with nil. This pull-based process is able to deliver results incrementally.
...
B00
= { Drago_Ibler a Architect. }
Alen_Peternac
Drago_Ibler
Juraj_Neidhardt
...
?person birthPlace Zagreb.
B0
= { ?person a Architect. ?person birthPlace Zagreb. }
Zagreb
Budapest
Rome
...
?city subject
Capitals_in_Europe.
B = { ?person a Architect. ?person birthPlace ?city. ?city subject Capitals_in_Europe. }
Fig. 2: A bgp iterator decomposes a bgp B = {tp1, . . . , tpn} into a triple pattern
iterator for an optimal tpi and, for each resulting solution mapping µ of tpi, creates
a bgp iterator for the remaining pattern B0
= {tp | tp = µ[tpj] ^ tpj 2 B} {µ[tpi]}.
Pre-print of a paper accepted to the International Semantic Web Conference 2014 (ISWC 2014).
The final publication is available at link.springer.com.
Querying Datasets on the Web with High Av
4.2 Dynamic iterator pipelines
A common approach to implement query execution in database sy
iterators that are typically arranged in a tree or a pipeline, based
results are computed recursively [10]. Such a pipelined approac
studied for Linked Data query processing [13,15]. In order to en
results and allow the straightforward addition of sparql oper
ment a triple pattern fragments client using iterators.
The previous algorithm, however, cannot be implemented by
pipeline. For instance, consider a query for architects born in Eu
SELECT ?person ?city WHERE {
?person a dbpedia-owl:Architect. # tp1
?person dbpprop:birthPlace ?city. # tp2
?city dc:subject dbpedia:Capitals_in_Europe. # tp3
} LIMIT 100
Suppose the pipeline begins by finding ?city mappings for tp
to choose whether it will next consider tp1 or tp2. The optimal
differs depending on the value of ?city:
– For dbpedia:Paris, there are ±1,900 matches for tp2, and
for tp1, so there will be less http requests if we continue w
– For dbpedia:Vilnius, there are 164 matches for tp2, and ±1
tp1, so there will be less http requests if we continue with
With a static pipeline, we would have to choose the pipeline stru
and subsequently reuse it.
In order to generate an optimized pipeline for each (sub-)qu
a divide-and-conquer strategy in which a query is decomposed d
49. Wikidata
• API access to
• items
• edit history
• items’ discussions
• items’ access statistics
• and more
• Linked Data interface
• MediaWiki API
• Wikidata Query
• SPARQL
• Linked Data Fragments
Access to more than
“just” usage.
50. Thank you very much!
@mluczak | http://markus-luczak.de
h"p://www.flickr.com/photos/therichbrooks/4040197666/, CC-BY 2.0, h"ps://creaVvecommons.or
51. References
• Luczak-Rösch, M., & Bischoff, M. (2011). Statistical analysis of web of data usage. In Joint Workshop on Knowledge Evolution and
Ontology Dynamics (EvoDyn2011), CEUR WS.
• Luczak-Rösch, M. (2014). Usage-dependent maintenance of structured Web data sets (Doctoral dissertation, Freie Universität Berlin,
Germany), http://edocs.fu-berlin.de/diss/receive/FUDISS_thesis_000000096138.
• Elbedweihy, K., Mazumdar, S., Cano, A. E., Wrigley, S. N., & Ciravegna, F. (2011). Identifying Information Needs by Modelling Collective
Query Patterns. COLD, 782.
• Elbedweihy, K., Wrigley, S. N., & Ciravegna, F. (2012). Improving Semantic Search Using Query Log Analysis. Interacting with Linked Data
(ILD 2012), 61.
• Raghuveer, A. (2012). Characterizing machine agent behavior through SPARQL query mining. In Proceedings of the International
Workshop on Usage Analysis and the Web of Data, Lyon, France.
• Arias, M., Fernández, J. D., Martínez-Prieto, M. A., & de la Fuente, P. (2011). An empirical study of real-world SPARQL queries. arXiv
preprint arXiv:1103.5043.
• Hartig, O., Bizer, C., & Freytag, J. C. (2009). Executing SPARQL queries over the web of linked data (pp. 293-309). Springer Berlin
Heidelberg.
• Verborgh, R., Hartig, O., De Meester, B., Haesendonck, G., De Vocht, L., Vander Sande, M., ... & Van de Walle, R. (2014). Querying
datasets on the web with high availability. In The Semantic Web–ISWC 2014 (pp. 180-196). Springer International Publishing.
• Verborgh, R., Vander Sande, M., Colpaert, P., Coppens, S., Mannens, E., & Van de Walle, R. (2014, April). Web-Scale Querying through
Linked Data Fragments. In LDOW.