History and Background of the USEWOD Data Challenge

USEWOD2012

History and Background of the
USEWOD Data Challenge

Knud Möller, Talis Systems Ltd.
@knudmoeller

1

Möller, K., Hausenblas, M., Cyganiak,
R., Grimnes, G., and Handschuh, S.
(2010). Learning from linked open
data usage: Patterns & metrics. In
WebScience 2010, Raleigh, NC, USA.
http://journal.webscience.org/302/

2

Linked Data

Conventional “Eye-ball” Web Web of Linked Data

interlinked documents interlinked items of data
(URIs, RDF)

mainly people / Web mainly machine agents (but
browsers also people?)

3

How is Linked Data being
used?
• plenty of research on conventional Web
usage
• what about usage of linked data?
Why?
• how healthy is the Web of linked data?
• who is using the data and how? Is it useful?
Are there trends?
• providers: improve hosting
• ... just curiosity! 4

Approach

particular sites:
– a URI for each data item ➙ a request for each data item
(resource)
– content negotiation best practices
– redirection (HTTP 303)

5

Approach

particular sites:
– a URI for each data item ➙ a request for each data item
(resource)
– content negotiation best practices
– redirection (HTTP 303)
http://data.semanticweb.org/
conference/www/2009

plain
resource URI

RDF HTML
document URI document URI
http://data.semanticweb.org/ http://data.semanticweb.org/
conference/www/2009/rdf conference/www/2009/html 5

Approach (ctd.)

• server log ﬁles
– common log format (CLF), combined log format

Request IP Request Date Request String

80.219.211.147 - - [23/May/2009:09:52:03 +0100] "GET /sparql?query=PREFIX [..] LIMIT+200 HTTP/1.0"
200 64674 "-" "ARC Reader (http://arc.semsol.org/)"

Response Code Responce Size Referrer User Agent

• RDF requests vs. “semantic” requests
•90.21.243.141 − − [06/Oct/2008:16:07:58 +0100] ”GET /organization/vrije
−universiteit−amsterdam−the−netherlands HTTP/1.1” 303 7592 ”−” ”rdflib
−2.4.0 (http://rdflib.net/; eikeon@eikeon.com)”
•90.21.243.141 − − [06/Oct/2008:16:08:02 +0100] ”GET /organization/vrije
−universiteit−amsterdam−the−netherlands/rdf HTTP/1.1” 200 453586 ”−” ”rdflib
−2.4.0 (http://rdflib.net/; eikeon@eikeon.com)”

219.211.147 - - [23/May/2009:09:52:03 +0100] "GET /sparql?query=PREFIX [..] LIMIT+200 HTTP/1.0"
200 64674 "-" "ARC Reader (http://arc.semsol.org/)"

nse Code Responce Size Referrer
Source Data
User Agent

Figure 1: The combined log format

# triples # days total # hits # plain hits # RDF hits # HTML hits SPARQL
Dog Food 79,175 597 8,427,967 1,923,945 259,031 1,647,205 879,932
(14,117) (3,223) (434) (2,759) (1,471)
DBpedia 109,750,000 118 87,203,310 22,821,475 7,008,310 22,999,237 20,972,630
(739,011) (193,402) (59,392) (194,909) (177,734)
DBTune 74,209,000 61 7,467,125 1,952,185 1,135,509 677,904 3,055,493
(122,412) (32,003) (18,615) (11,113) (50,090)
RKBExplorer 91,501,684 29 529,938 — — — 9,327
(18,274) (—) (—) (—) (322)

RDF 5.8% Semantic 2.8% Table RDF 14.9% Semantic 4.2% datasets
1: Overview of four LOD RDF 7.8% Semantic 2.5%

s are served. For our evaluation, we had access to log
Plain 47.7%
taining a SPARQL query, we assume that it is
Plain 45% Plain 41.0%
two periods: from 24/05/2009–21/06/2009 and from ble of handling the query result, i.e., either a
/2009–29/10/2009, i.e., roughly two months. bindings (in the case of a SELECT query), pote
containing URIs of RDF resources, or an RDF
RKBExplorer (in the case of a CONSTRUCT or DESCRIBE q
BExplorer6 [11] is another meta-dataset currently com-
HTML 46.5% HTML 39.9%
44 sub-datasets covering various topics and sources • RDF requests: if an agent directly requests
HTML 51.1%
the domain of academic research, as well as a Web from a server, we assume that it knows how t
ation that allowsDBpedia
users to access and browse its content cess data in this format. 7 Directly here mean
DBTune the agent speciﬁed an RDF syntax such as r
Dog Food
ntegrated fashion. Both RDF and HTML documents
the resources in all datasets are available. Apart from as an acceptable response in the header of its re

Agents
http://data.semanticweb.org, 21/07/2008 - 20/06/2009
500000
hits

3)
83

66 8
97 ordinary trafﬁc: the usual suspects

37 23
)
(4

13 59
400000
ot

(1
eB

rp

)
28
gl

lu
&

)
11
89
oo

!S

92
11
G

oo

(1
t(
h

300000
bo

er

5)
Ya

32
ch
sn

12
et
m
hits

eF

r(
le
ic

w
nd

ra
Si

200000

)
tic

42
ul

3

8)
(7
m

80
.0

(6
/1

r
ot

de
fb
100000

ea
rd

R
C
R
A
0 8
0 5 10 15 20 25 30
agents

semantic hits/total hits (>100 semantic hits)

0
0.2
0.4
0.6
0.8
1
attributor/1.13.2
triplr
sindicebot
rdflib-2.4.2
Ripple
OL_Virtuoso_RDF_crawler
Morph_Converter_Service
Falconsbot
Speedy
Slug_SW_Crawler
yacybot
hclsreport-crawler
MJ12bot
PycURL
heritrix/1.14.3
SindiceFetcher
heritrix/pom.version
heritrix/2.0.2
multicrawler
SindiceBot
ia_archiver
Zitgist-APlusPlus-Agent
rdflib-2.4.1
they?

Mp3Bot
curl
Zend_Http_Client
Speedy_Spider
nxcrawler
marbles
-
Java
rdflib-2.4.0
(unknown)
ARC_Reader
MLBot
Mozilla
Jakarta_HttpClient
9

Wget
libwww-perl
MSIE
Firefox
Python-urllib
sindice_ontology_fetcher
Agents: How “semantic” are

Googlebot

Demand for LOD?
DBpedia Hits over Time (smoothing factor 0.05)
300000
plain
html
rdf
250000 semantic

200000

150000

100000
no increase for semantic requests
50000

0
2009-06-20

2009-07-04

2009-07-18

2009-08-01

2009-08-15

2009-08-29

2009-09-12

2009-09-26

2009-10-10

2009-10-24

2009-11-07
10

Impact of Real-world
Events
Irish Lisbon Treaty Referendum (smoothing factor 0.05)
9
http://dbpedia.org/resource/Republic_of_Ireland
http://dbpedia.org/resource/European_Union
8 http://dbpedia.org/resource/Treaty_of_Lisbon

7
possible impact
6

5

4

3

2

1

0
2009-06-20

2009-07-04

2009-07-18

2009-08-01

2009-08-15

2009-08-29

2009-09-12

2009-09-26

2009-10-10

2009-10-24

2009-11-07
11

Impact of Real-world
Events
Michael Jackson Memorial Service (smoothing factor 0.05)
4.5
http://dbpedia.org/resource/Staples_Center
http://dbpedia.org/resource/Michael_Jackson_memorial_service
4 http://dbpedia.org/resource/Michael_Jackson

3.5

3

2.5

2

1.5
possible impact
1

0.5

0
2009-06-20

2009-07-04

2009-07-18

2009-08-01

2009-08-15

2009-08-29

2009-09-12

2009-09-26

2009-10-10

2009-10-24

2009-11-07
12

• this research: one motivation for
USEWOD
• expand the dataset, encourage more
and different analyses

13

USEWOD Data Challenge 2012

2nd International Workshop on Usage Analysis
and the Web of Data

Sponsored by the LATC project


Moving forward by releasing a dataset:
• to relieve difficulty of obtaining real-life usage
data
• to allow comparison and combination of
approaches done on the same dataset
• by releasing a relatively new type of logs: usage
on the Web of Data.


Moving forward by releasing a dataset:
• to relieve difficulty of obtaining real-life usage
data
• to allow comparison and combination of
approaches done on the same dataset
• by releasing a relatively new type of logs: usage
on the Web of Data.
Also for longer-term use.

The USEWOD Dataset 2011

Server logs from two major web of data
servers:
• DBPedia
• Several weeks during 2 months of requests
• Semantic Web Dog Food
• 2 years of requests from Dec 2008 – Dec 2010

USEWOD 2011 Challenge
Participants

Participants

• At the time of the workshop 11 groups had
requested the 2011 data

Participants

• By now 28.

Participants

• By now 28.
• 7 data challenge paper submissions

Participants

• By now 28.
• 7 data challenge paper submissions
• Winner of the 2011 USEWOD data challenge prize:
• Mario Arias Gallego, Javier D. Fernández, Miguel A.
Martínez-Prieto and Pablo De La Fuente. An Empirical
Study of Real-World SPARQL Queries.

The USEWOD Dataset 2012

Server logs from two major web of data
servers:
• DBPedia
• Several weeks during 2 months of requests
• Semantic Web Dog Food
• 2 years of requests from Dec 2008 – Dec 2010
• Linked Open Geo Data
• Bio2RDF

Participants

• 20 groups requested the data, so far.
• 2 data challenge paper submissions…
• 1 winner of the USEWOD data
challenge prize.
• kindly brought to you by LATC

Semantic Web Dog Food

[Screenshots and image take from http://data.semanticweb.org/]

Requests for Human / Machine
readable Web data

Both servers get requests for RDF
• http://dbpedia.org/data/Berlin.rdf
as well as HTML
• http://dbpedia.org/page/Berlin
And requests for the URI itself:
• http://dbpedia.org/resource/Berlin (will be
served HTML or RDF)

Requests to SPARQL endpoints

• Both servers have a SPARQL endpoint
to request RDF data:
SELECT DISTINCT ?s ?t ?y ?to ?h
WHERE {
?s dc:title ?t .
?s swrc:year ?y .
OPTIONAL {?s foaf:homepage ?h }.
OPTIONAL {?s foaf:topic ?t }
}
order by desc(?y”)
LIMIT 200

Anonymizing the USEWOD
Dataset

Anonymizing the USEWOD
Dataset

• IP addresses:
• replace all IPs with 0.0.0.0
• add the country code for the original IP
address -> track location of requests
• add an identiﬁer of the original IP -> track
individual requestors

USEWOD2011, Hydebarabad,
India

• M. Kirchberg, R. K. L. Ko, and B. S.
Lee. From linked data to relevant data
- time is the essence. - http://
arxiv.org/abs/1103.5046
• M. A. Gallego, J. D. Fernández, M. A.
Martínez-Prieto, and P. D. L. Fuente.
An empirical study of real-world
SPARQL queries. - http://arxiv.org/
abs/1103.5043 25

USEWOD2012, Lyon, France

• A. Raghuveer. Characterizing Machine
Agent Behavior through SPARQL Query
Mining. - http://ir.ii.uam.es/
usewod2012/
usewod2012_raghuveer.pdf
• J. Hoxha, M. Junghans, S. Agarwal.
Enabling Semantic Analysis of User
Browsing Patterns in the Web of Data.
- http://arxiv.org/abs/1204.2713
26

What could be improved?
• does not work well with embedded metadata (e.g.,
RDFa-based sites)

• does not take into account usage through meta sites
(indexes, search engines, mirrors, ...)

• does (probably) not take into account usage through
apps

• what about caches?

• what about bulk/dump downloads of data?

• not enough usage to be interesting yet? 27

History and Background of the USEWOD Data Challenge

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Was ist angesagt?

Was ist angesagt? (11)

Andere mochten auch

Andere mochten auch (7)

Ähnlich wie History and Background of the USEWOD Data Challenge

Ähnlich wie History and Background of the USEWOD Data Challenge (20)

Kürzlich hochgeladen

Kürzlich hochgeladen (20)

History and Background of the USEWOD Data Challenge

Hinweis der Redaktion