The WebART project developed tools to facilitate scholarly use of web archives. It created an initial search interface called WebARTist to explore a pilot dataset of 432 crawls from the Dutch National Library web archive. The interface allowed full-text search and basic analysis like word frequency, co-word analysis, and geomapping. A workshop with researchers evaluated the interface and provided feedback on improving data quality, search capabilities, and user experience to better meet researcher needs. Next steps include a new prototype with more advanced features and a formal evaluation of the pilot project.
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
WebART: Facilitating Scholarly Use of Web Archives (IIPC, Apr. 2013)
1. WebART project
Web Archive RetrievalTools
Jaap Kamps, Richard Rogers, Arjen deVries
Paul Doorenbosch, RenéVoorburg,Victor-JanVos
Anat Ben-David, Hugo Huurdeman,Thaer Sammar
Flickr: LucViatour
IIPC symposium “Scholarly Access to Web Archives”, Ljubljana,April 25, 2013
2. WebART project
Web Archive RetrievalTools
Jaap Kamps, Richard Rogers, Arjen deVries
Paul Doorenbosch, RenéVoorburg,Victor-JanVos
Anat Ben-David, Hugo Huurdeman,Thaer Sammar
Flickr: LucViatour
“Facilitating Scholarly Use Of Web Archives”
IIPC symposium “Scholarly Access to Web Archives”, Ljubljana,April 25, 2013
6. WebART Goals
•Evaluating current curation and selection
procedures of Web archives
•Getting insights into current use of Web
archives
•Developing new methods and tools for
research using Web archives
34. Use case analysis (1)
•DMI Winter School
•Analysis types performed:
• Word frequency count, Outlink frequency
count
• (Visual) Co-Word analysis
• Geomapping
• “Temporal Analysis”
35. Use case analysis (2)
Analysis / visualization:
DMI Dorling Map Tool,
Gephi, Google Fusion
tables, Google Refine,
TimelineJS
Data processing:
Excel, Google Spread-
sheets
36. Use case analysis (3)
•Basic usage statistics WebARTist
0
7,5
15
22,5
30
Date filter Site filter Collection filter
Percentage of queries
37. Use case conclusions (1)
•Data quality and quantity
• Limited dataset, but many analysis types possible
(daily news crawls)
• Not always clear what’s in & what’s out
• crawl settings (e.g depth), temporal gaps
• Data expansion opportunity:
• combining datasets (but ...)
• e.g. KB, CommonCrawl & IA
Completeness
Inconsistencies
38. Use case conclusions (2)
•Search System
• Influence of retrieval algorithms & indexing settings
• Recall & Precision: precision issues
• Feature request: duplicate handling
•Interface
• How to convey uncertainty?
• How to convey advanced technical features?
• e.g. advanced query mechanisms
39. Use case conclusions (3)
•Users
• High demand for export functions (formats)
• (un)familiarity with temporal (archive) search
• Trying to utilize “current Web” tools (e.g. link
analysis), not applicable to “past Web”
• “User search as in (regular) Web search
engines” ( see also [Costa & Silva ’11] )
40. Next steps WebART
•New prototype ready (~3TB)
• faceted search, thumbnail browsing,
site categories & advanced metadata
•Formal evaluation of pilot project
• Web archive critique
• Search system
•Research scenarios & use cases