3. Steffen Staab
staab@uni-koblenz.de
3WeST
Web Science & Technologies
University of Koblenz ▪ Landau, Germany
The Challenges of Building
Interoperable Web Observatories
http://wow.west.webobservatory.org/
Steffen Staab
10. Steffen Staab
staab@uni-koblenz.de
10WeST
Challenges – Data Collection Issues
Crawling
What does it mean to crawl a heavily interactive site?
Incomplete data
Where to start?
• We cannot observe everything!
– Even just for data size!
– What appear to be most fruitful starting points?
11. Steffen Staab
staab@uni-koblenz.de
11WeST
Challenges – Data Collection Issues
Crawling
What does it mean to crawl a heavily interactive site?
Incomplete data
Where to start?
Where to stop?
• Each crawl is a view
– Twitter
» Tweet
» URL
» Web Page
» Subweb
» Followers
» Followers‘ Followers
» ...
12. Steffen Staab
staab@uni-koblenz.de
12WeST
Challenges – Data Collection Issues
Crawling
What does it mean to crawl a heavily interactive site?
Incomplete data
Where to start?
Where to stop?
Synchronous vs asynchronous
• Strictly speaking: only asynchronous crawling possible
– But in [Dellschaft&Staab] we targeted the construction of
models for streams of tags
13. Steffen Staab
staab@uni-koblenz.de
13WeST
Challenges – Data Publishing Issues
Legal and/or Ethical Example Issues
AOL query log
Netflix challenge
Delicious
http://www.tagora-project.eu/data/
Twitter
Collecting, but no sharing
• SocialSensor project
14. Steffen Staab
staab@uni-koblenz.de
14WeST
Challenges – Data Publishing Issues
Technical/Modelling issues
Generic format, e.g. RDF
Format ready for digestion by a certain software, e.g. for
Matlab processing
Openness to other data
E.g. references to DBPedia/Wikipedia
Accuracy of publishing
http://me.org showed „...“
http://me.org showed „...“@2013-05-01:0900CEST
http://me.org showed „...“@2013-05-01:0900CEST called
from IP 193.99.144.85 using browser...version...history...
15. Steffen Staab
staab@uni-koblenz.de
15WeST
Sharing Software
Software
For crawling or usage logging
Rather than sharing the data, share the code for observing
Example:
code for crawling Twitter in a certain way
Issues
Limited repeatability
Disturbance liability („Störerhaftung“) – at least in DE
• If you provide source code for crawling, e.g., Facebook, even
if you do not crawl FB, FB can sue you
18. Steffen Staab
staab@uni-koblenz.de
18WeST
Ongoing discussion
What to do about sharing Web Science datasets?
Let‘s do simple things first
Collect pointers!
Publish whatever you can publish – others will reuse
Make it more archival
In a way that makes it easy to expand to handle more
complex issues
Semantic Wiki!
19. Steffen Staab
staab@uni-koblenz.de
19WeST
Web Observatory Wiki
• Main Goals:
• Registry of Web Science datasets
• Compiled by Web Observatory participants – YOU!
• Minor Goals
• Semantically store all information about datasets
• Make it
• Explorable
• Queryable
• Reuseable
25. Steffen Staab
staab@uni-koblenz.de
25WeST
Discussion & Q&A
Access to wiki
Current model:
• Edits allowed by IPs and users
• Everyone can be blocked, including IPs
Contribute:
Content
Modeling requirements
...
Let us know!
26. Steffen Staab
staab@uni-koblenz.de
26WeST
Sanity Check
Understanding
Collecting (to some extent: commodity service)
Describing (WOW)
Analyzing
Modeling
Predicting
Repeating!
So far ad hoc –
needs much more:
• Experience
• Guidelines
• Processing workflow
• Executable code shares
(on big data!)
• ...