Challenges of Building Web Observatories

Steffen Staab
staab@uni-koblenz.de
1WeST
Vote for free Web Science MOOC!

Steffen Staab
2WeST
You want to have more free
Web Science Education on the Web?
Vote for our course at
https://moocfellowship.org/
now!

Steffen Staab
3WeST
Web Science & Technologies
University of Koblenz ▪ Landau, Germany
The Challenges of Building
Interoperable Web Observatories
http://wow.west.webobservatory.org/
Steffen Staab

Steffen Staab
4WeST
Produce
Consum
e
Cognition
Emotio
n
Behavio
r
Socialisati
onKnowledge
Observable
Micro-
interaction
s in the
Web
Apps
Protoco
ls
Data & Information
Governance
WWW
Observable
Macro-
effects in
the Web
What to observe?

Steffen Staab
5WeST
Why to observe?
 Understanding
 Collecting
 Describing
 Analyzing
 Modeling
 Predicting
 Repeating!

Steffen Staab
6WeST
Why to observe?
 Understanding
 Collecting
 Describing
 Analyzing
 Modeling
 Predicting
 Repeating!

Steffen Staab
7WeST
Produce
Consum
e
Cognition
Emotio
n
Behavio
r
Socialisati
onKnowledge
Observable
Micro-
interaction
s in the
Web
Apps
Protoco
ls
Data & Information
Governance
WWW
Observable
Macro-
effects in
the Web
What to observe?
Web Crawling Usage
Logging

Steffen Staab
8WeST
Challenges – Data Collection Issues
Legal and/or Ethical
 Crawling
 May be disallowed by provider
 Usage logging
 Privacy of individuals
 Even if it is allowed....

Steffen Staab
9WeST
 Crawling
 What does it mean to crawl a heavily interactive site?
 Incomplete data
• Unreachability
• Time outs

Steffen Staab
10WeST
 Crawling
 Incomplete data
 Where to start?
• We cannot observe everything!
– Even just for data size!
– What appear to be most fruitful starting points?

Steffen Staab
11WeST
 Crawling
 Incomplete data
 Where to start?
 Where to stop?
• Each crawl is a view
– Twitter
» Tweet
» URL
» Web Page
» Subweb
» Followers
» Followers‘ Followers
» ...

Steffen Staab
12WeST
 Crawling
 Incomplete data
 Where to start?
 Where to stop?
 Synchronous vs asynchronous
• Strictly speaking: only asynchronous crawling possible
– But in [Dellschaft&Staab] we targeted the construction of
models for streams of tags

Steffen Staab
13WeST
Challenges – Data Publishing Issues
Legal and/or Ethical Example Issues
 AOL query log
 Netflix challenge
 Delicious
 http://www.tagora-project.eu/data/
 Twitter
 Collecting, but no sharing
• SocialSensor project

Steffen Staab
14WeST
Challenges – Data Publishing Issues
Technical/Modelling issues
 Generic format, e.g. RDF
 Format ready for digestion by a certain software, e.g. for
Matlab processing
 Openness to other data
 E.g. references to DBPedia/Wikipedia
 Accuracy of publishing
 http://me.org showed „...“
 http://me.org showed „...“@2013-05-01:0900CEST
 http://me.org showed „...“@2013-05-01:0900CEST called
from IP 193.99.144.85 using browser...version...history...

Steffen Staab
15WeST
Sharing Software
 Software
 For crawling or usage logging
 Rather than sharing the data, share the code for observing
 Example:
 code for crawling Twitter in a certain way
 Issues
 Limited repeatability
 Disturbance liability („Störerhaftung“) – at least in DE
• If you provide source code for crawling, e.g., Facebook, even
if you do not crawl FB, FB can sue you

Steffen Staab
16WeST
Why to observe?
 Understanding
 Collecting
 Describing
 Analyzing
 Modeling
 Predicting
 Repeating!

Steffen Staab
17WeST
WEB OBSERVATORY WIKI
In spite of all this....

Steffen Staab
18WeST
Ongoing discussion
 What to do about sharing Web Science datasets?
 Let‘s do simple things first
 Collect pointers!
 Publish whatever you can publish – others will reuse
 Make it more archival
 In a way that makes it easy to expand to handle more
complex issues
 Semantic Wiki!

Steffen Staab
19WeST
Web Observatory Wiki
• Main Goals:
• Registry of Web Science datasets
• Compiled by Web Observatory participants – YOU!
• Minor Goals
• Semantically store all information about datasets
• Make it
• Explorable
• Queryable
• Reuseable

Steffen Staab
20WeST
 Semantic MediaWiki + Forms Extension
 URL: http://wow.west.webobservatory.org/
 Main classes: Examples:
 Dataset_Repository KONECT
 Dataset Slashdot Zoo
 Organization WeST
Quick Facts -1

Steffen Staab
21WeST
 Semantic MediaWiki + Forms Extension
 URL: http://wow.west.webobservatory.org/
 Class Hierarchy Example: Attributes:
 Dataset Dublin Core +
Size, license, URL,…
 Network Node Count
 Social Network …
Quick Facts - 2

Steffen Staab
22WeST
Semantic Exploration by Views

Steffen Staab
23WeST
Semantic Forms: Providing Data

Steffen Staab
24WeST
ko:konect
ko:slashdot-zoo
wow:contains
1944
wow:network-volume
wow:social-network
rdf:type
wow:network
rdfs:subClassOf
wow:dataset
rdfs:subClassOf
ko:twitter
wow:contains
120000000
wow:size
wow:network-volume
rdfs:domain
wow:size
rdfs:domain
rdf:type
wow:dataset-repository
rdf:type
wow:contains
rdfs:domain
rdfs:range
Schema (Excerpt)

Steffen Staab
25WeST
Discussion & Q&A
 Access to wiki
 Current model:
• Edits allowed by IPs and users
• Everyone can be blocked, including IPs
 Contribute:
 Content
 Modeling requirements
 ...
 Let us know!

Steffen Staab
26WeST
Sanity Check
 Understanding
Collecting (to some extent: commodity service)
Describing (WOW)
Analyzing
Modeling
Predicting
Repeating!
So far ad hoc –
needs much more:
• Experience
• Guidelines
• Processing workflow
• Executable code shares
(on big data!)
• ...

Steffen Staab
27WeST
 What else do we need?

Steffen Staab
28WeST
Vote at: https://moocfellowship.org/

Challenges of Building Web Observatories

Recommended

Recommended

More Related Content

What's hot

What's hot (7)

Similar to Challenges of Building Web Observatories

Similar to Challenges of Building Web Observatories (20)

More from Steffen Staab

More from Steffen Staab (20)

Recently uploaded

Recently uploaded (20)

Challenges of Building Web Observatories