WebART: Facilitating Scholarly Use of Web Archives (IIPC, Apr. 2013)

WebART project
Web Archive RetrievalTools
Jaap Kamps, Richard Rogers, Arjen deVries
Paul Doorenbosch, RenéVoorburg,Victor-JanVos
Anat Ben-David, Hugo Huurdeman,Thaer Sammar
Flickr: LucViatour
IIPC symposium “Scholarly Access to Web Archives”, Ljubljana,April 25, 2013

WebART project
Web Archive RetrievalTools
Jaap Kamps, Richard Rogers, Arjen deVries
Paul Doorenbosch, RenéVoorburg,Victor-JanVos
Anat Ben-David, Hugo Huurdeman,Thaer Sammar
Flickr: LucViatour
“Facilitating Scholarly Use Of Web Archives”
IIPC symposium “Scholarly Access to Web Archives”, Ljubljana,April 25, 2013

Thaer Samar
PhD/programmer
Hugo Huurdeman
PhD researcher
Anat Ben-David
Postdoc
Arjen deVries Jaap Kamps Richard Rogers
Paul Doorenbosch
RenéVoorburg
Victor-JanVos

WebART Goals
•Evaluating current curation and selection
procedures of Web archives
•Getting insights into current use of Web
archives
•Developing new methods and tools for
research using Web archives

Flickr: koninklijkebibliotheek
KB:Web archive since 2007
Statistics:
•4,000+ websites
•17,000+ harvests
•7+TerabyteSelective approach

KB:Web archive since 2007
Statistics:
•4,000+ websites
•17,000+ harvests
•7+TerabyteSelective approach
Original image:A N P

”Wayback Machine” interface

• WebARTist (pilot - beta 1)
• Initial dataset (corpus)
• 432 crawls, 16 months (13.64 GB)
Full-text search engine
KB CommonCrawl+
nu.nl
(Dutch news aggregator)

WebARTist: Use case
• Digital Methods Winter School (Jan. ’13)
• Co-design workshop (“Living Lab”)
• researchers & developers
• ﬁrst use WebARTist

Word frequency analysis
0
100
200
300
400
500
600
700
800
17/05/2011 25/08/2011 03/12/2011 12/03/2012 20/06/2012 28/09/2012 06/01/2013

1
abcnews.go.com1
brucespringsteen.net
1
theverge.com
1
sportamerika.nl
1
reuters.com
1
ebird.org
1
googleblog.blogspot.co.uk
1
presscentre.sony.eu
1
project.wnyc.org
1
bbc.com
1
poynter.org
1
abclocal.go.com
1
en.wikipedia.org
1
nhc.noaa.gov
1
nypost.com
2
earthcam.com
2
maps.google.com
3
hp.com
4
google.org
4
edition.cnn.com
Syria
Sandy
7
wired.com
7
allthingsd.com
7
abcnews.go.com
7
thesun.co.uk
7
allesoversterrenkunde.nl
8
volkskrant.nl
9
fd.nl
9
nos.nl
9
mobiel.nuvideo.nl
9
guardian.co.uk
10
bit.ly
10
billboard.biz
10
cbsnews.com
11
usmagazine.com
11
variety.com
12
theverge.com
12
people.com
13
Rutte enVerhagen leggen schuld bij PVV
13
telegraaf.nl
14
washingtonpost.com
18
edition.cnn.com
19
bbc.co.uk
20
youtube.com
20
nytimes.com
21
styletoday.nl
21
bloomberg.com
24
thesistools.com
26
hollywoodreporter.com
30
online.wsj.com
30
deadline.com
33
poll.nupubliek.nl
34
spaarrente.nl
39
gamer.nl
48
reuters.com
52
tmz.com
57
open.spotify.com
78
peil.nl
93
gezondheidsnet.nl
US Election
4
1
blogs.aljazeera.net
1
youtube.com
1
worldpressphoto.org
1
wikileaks.org
1
washingtonpost.com
1
eubusiness.com
1
vesti.bg
1
trouw.nl
1
#NAME
1
en.wikipedia.org
1
l
1
sana.sy
1
hosted.ap.org
1
shariah4belgium.com
1
nrc.nl
1
guardian.co.uk
1
geopolicity.com
1
nctb.nl
1
rt.com
1
kaspersky.com
2
todayszaman.com
2
volkskrant.nl
2
spaarrente.nl
2
reuters.com
2
peil.nl
2
hrw.org
2
uk.reuters.com
2
cbsnews.com
3
telegraph.co.uk
3
maps.google.nl
4
bbc.co.uk
5
edition.cnn.com
5
aljazeera.com
english.alarabiya.net
7
maps.google.com
Outlink Analysis

Geomapping location Wire service

Use case analysis (1)
•DMI Winter School
•Analysis types performed:
• Word frequency count, Outlink frequency
count
• (Visual) Co-Word analysis
• Geomapping
• “Temporal Analysis”

Analysis / visualization:
DMI Dorling Map Tool,
Gephi, Google Fusion
tables, Google Reﬁne,
TimelineJS
Data processing:
Excel, Google Spread-
sheets

•Basic usage statistics WebARTist
0
7,5
15
22,5
30
Date filter Site filter Collection filter
Percentage of queries

Use case conclusions (1)
•Data quality and quantity
• Limited dataset, but many analysis types possible
(daily news crawls)
• Not always clear what’s in & what’s out
• crawl settings (e.g depth), temporal gaps
• Data expansion opportunity:
• combining datasets (but ...)
• e.g. KB, CommonCrawl & IA
Completeness
Inconsistencies

•Search System
• Inﬂuence of retrieval algorithms & indexing settings
• Recall & Precision: precision issues
• Feature request: duplicate handling
•Interface
• How to convey uncertainty?
• How to convey advanced technical features?
• e.g. advanced query mechanisms

•Users
• High demand for export functions (formats)
• (un)familiarity with temporal (archive) search
• Trying to utilize “current Web” tools (e.g. link
analysis), not applicable to “past Web”
• “User search as in (regular) Web search
engines” ( see also [Costa & Silva ’11] )

Next steps WebART
•New prototype ready (~3TB)
• faceted search, thumbnail browsing,
site categories & advanced metadata
•Formal evaluation of pilot project
• Web archive critique
• Search system
•Research scenarios & use cases

Summary
•Introduction WebART & CATCH
•Pilot project
• WebARTist
• DMI Winter School Use Case
• Analysis & Conclusions Use Case
•The Future
Summary

WebART: Facilitating Scholarly Use of Web Archives (IIPC, Apr. 2013)

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Was ist angesagt?

Was ist angesagt? (17)

Ähnlich wie WebART: Facilitating Scholarly Use of Web Archives (IIPC, Apr. 2013)

Ähnlich wie WebART: Facilitating Scholarly Use of Web Archives (IIPC, Apr. 2013) (20)

Mehr von TimelessFuture

Mehr von TimelessFuture (20)

Kürzlich hochgeladen

Kürzlich hochgeladen (20)

WebART: Facilitating Scholarly Use of Web Archives (IIPC, Apr. 2013)