SlideShare ist ein Scribd-Unternehmen logo
1 von 46
Downloaden Sie, um offline zu lesen
Panel:

Web Archiving –
Lessons and Potential
Abbie Grotke (Library of Congress)
Barbara Signori (Swiss National Library)
Clément Oury (Bibliothèque nationale de France)
Daniel Gomes (Portuguese Web Archive)
Mário J. Silva (INESC-ID)
Nuno Freire (European Library)
Lessons learned
The Portuguese Web Archive
project started in 2008
It provides version history like the
Internet Archive Wayback Machine
But also full-text search over 1.2 billion
web files archived since 1996
Acquiring web data
We needed to integrate third-party
collections archived before 2007
• An archive must have
“old stuff”
• Integration of
historical collections
– 1.9 TB from the
Internet Archive
between 1996 and
2007
– 600 MB CD ROM with
sites published in 1996
Tools to convert saved web files to ARC
format

• “Dead” archived collections became searchable
and accessible
• Specific conversion tools per collection were
required but baseline software could be reused
Oldest Library of Congress site
(October 1996)

• The integration effort was worth to save few
but valuable information
Crawling the live-web since 2007

• Trimestral broad crawls: 78 million files per crawl
• Daily selective crawls: 764 000 files per day
• Heritrix 1.14.3 initially configured based on previous
experience crawling the Portuguese Web
– Trial-error process until final configuration
• Must recheck configurations periodically
The URLs of the publications crawled
daily change frequently

• Expresso newspaper had 5 different domains since
2008
• Seed list of daily crawls must be periodically validated
by humans
Default Robots.txt of Content Management
Systems forbid crawling images

• Developers of popular Content Management
Systems are not aware of web archiving
– Joomla forbids images since 2007
Attempt to raise awareness
• Contacted webmasters of the selected
publications by email
– Only 10% returned feedback

• None, raised any objection, just questions.
• Some, did not know they had robots exclusion
rules on their sites.
• Most, did not know what was a “web archive”.
• All, were satisfied from being selected as
representatives of our cultural heritage
• Downloads content, computes checksum and
compares it with version from the previous crawl
– Unchanged->Discarded
– Changed->Stored

• No impact on download rate
Savings on Trimestral crawls
Average disk space per
trimestral crawl (TB)
4
3
2
1
0
NoDedup

DeDup

41% less disk space to store content
Savings on Daily crawls
Average disk space per daily crawl
(GB)
35
30
25
20
15
10
5
0
NoDedup

DeDup

76% less disk space to store content
Total savings from using DeDuplicator

26.5 TB/year
• Using DeDuplicator saved space without
performance degradation.
Ranking the past Web
NutchWAX as baseline for
full-text search
Users were not satisfied with
NutchWAX search
• Unpolished
interface
• Slow results
– 40M URLs, >20s

• Low relevance
for search results
Developed a new web archive
search system
• Quicker response times
• Improve relevance for search results
Had to build a Web Archive Information
Retrieval Test Collection: PWA9609
• To evaluate and improve relevance for search
results
• Corpus of documents from 1996 to 2009
– 255 million web pages (8.9 TB)
– 6 collections: Internet Archive, PWA broad crawls,
integrated collections

• Gold collection
– Query, relevant results
Time-aware ranking models yield
better search results
Metric

Time-unaware
ranking models

Time-aware ranking models
(our proposals)

NutchWAX

TVersions

TSpan

MdRankBoost
(L2R)

nDCG@1

0.250

0.430

0.450

0.550

nDCG@10

0.174

0.202

0.193

0.555

Precision@1

0.320

0.500

0.520

0.600

Precision@10

0.168

0.172

0.158

0.194

More details: Miguel Costa, Mário J. Silva, Evaluating Web
Archive Search Systems, WISE’2012
Designing user interface
NutchWAX (2007) vs. PWA (2012)

•
•
•
•

Internationalization support
New graphical design
Advanced search user interface
71% overall user satisfaction from rounds of usability testing
Observations from usability testing
Searching the past web is a confusing concept

• Understanding web archiving requires being techie
• Must provide examples of web-archived pages
Users are addicted to query
suggestions

• Developed query suggestions mechanism
for web archive search
Users “google” the past and we have
to comply
• Users search web archives replicating their
behavior from live-web search engines
• Users input queries on the first input box that
they find
– Search system must identify query type (URL or
full-text) and present corresponding results

• Must provide additional tutorials and
contextual help to search the past web
Hardware
Blade Systems/Storage Area Networks
vs. Independent servers
• 61 computers, 1.8 TB RAM,
340 disks (370 TB)
• Blade systems and SAN are
not adequate for web
archiving
– Extremely expensive
– Single points of failure
– Hard to manage

• Independent servers are
cheaper and more reliable
Legal issues
Just concerns
•
•
•
•

Respect Robots Exclusion Protocol
1 year embargo
Proactively remove illicit content
Remove content on-demand by
authors
Potential as research
infrastructure
API to process archived data using the
PWA Hadoop cluster
Measure web accessibility for people
with disabilities

In Rui Lopes, Daniel Gomes, Luís Carriço, Web Not For All: A Large Scale
Study of Web Accessibility, 2010
Characterizations of the Portuguese
Web structure

Media type

% contents
2005

% contents
2008

Trend

Text/html

61.2%

57.8%

-5.5%

Image/jpeg

22.6%

22.8%

+1.2%

Image/gif

11.4%

9.4%

-17.4%

Text/pdf

1.6%

1.9%

+18.5%

Other

3.2%

8.1%

-

In João Miranda, Daniel Gomes, Trends in Web characteristics, 2009.
Archiving Web Spam degrades
search results

1st search result is a Web Spam. 
But archiving Web Spam is not useless
for research:
Improve Web Spam detectors!

In A. Garzó et al., Cross-Lingual Web Spam
Classification, 2013
OpenSearch to extend functionality
Web archive search can be easily
integrated on web browsers
OpenSearch used by Computer Science
students to create new web applications

• Web application combines information about politicians
from several sources: Wikipedia, Youtube, Twitter,
Portuguese Web Archive
All our source code and test collections
are freely available
Conclusions
• Web archives are crucial infrastructures for
modern societies
• Must raise awareness about web archiving
among users and developers
• We need to collaborate
Panel discussion
1. How is your experience related to this work?
2. How could web archives be further improved?
3. How could web archives interact with libraries/other
cultural heritage organizations?
4. How to unfold the full potential of web archives as
research infrastructures?
5. Which innovative collaborations could be
established?
6. What is the role of web archiving in modern
societies?
7. …

Weitere ähnliche Inhalte

Andere mochten auch

Красноярский электротехнический журнал «Энергетика и электроснабжение регионо...
Красноярский электротехнический журнал «Энергетика и электроснабжение регионо...Красноярский электротехнический журнал «Энергетика и электроснабжение регионо...
Красноярский электротехнический журнал «Энергетика и электроснабжение регионо...Energetika
 
Wave5 thesocialisationofbrands-report-101017073230-phpapp02
Wave5 thesocialisationofbrands-report-101017073230-phpapp02Wave5 thesocialisationofbrands-report-101017073230-phpapp02
Wave5 thesocialisationofbrands-report-101017073230-phpapp02Bertrand CHARLET
 
[PREMONEY 2013] Jeff lawson
[PREMONEY 2013] Jeff lawson [PREMONEY 2013] Jeff lawson
[PREMONEY 2013] Jeff lawson 500 Startups
 
Wie sicher sind Online-Zahlungen?
Wie sicher sind Online-Zahlungen?Wie sicher sind Online-Zahlungen?
Wie sicher sind Online-Zahlungen?Bankenverband
 
Deutsche EuroShop - Conference Call Presentation - Preliminary Results FY 2014
Deutsche EuroShop - Conference Call Presentation - Preliminary Results FY 2014Deutsche EuroShop - Conference Call Presentation - Preliminary Results FY 2014
Deutsche EuroShop - Conference Call Presentation - Preliminary Results FY 2014Deutsche EuroShop AG
 

Andere mochten auch (6)

Красноярский электротехнический журнал «Энергетика и электроснабжение регионо...
Красноярский электротехнический журнал «Энергетика и электроснабжение регионо...Красноярский электротехнический журнал «Энергетика и электроснабжение регионо...
Красноярский электротехнический журнал «Энергетика и электроснабжение регионо...
 
Wave5 thesocialisationofbrands-report-101017073230-phpapp02
Wave5 thesocialisationofbrands-report-101017073230-phpapp02Wave5 thesocialisationofbrands-report-101017073230-phpapp02
Wave5 thesocialisationofbrands-report-101017073230-phpapp02
 
[PREMONEY 2013] Jeff lawson
[PREMONEY 2013] Jeff lawson [PREMONEY 2013] Jeff lawson
[PREMONEY 2013] Jeff lawson
 
Social Media im Verband Deutscher Tapetenindustrie
Social Media im Verband Deutscher TapetenindustrieSocial Media im Verband Deutscher Tapetenindustrie
Social Media im Verband Deutscher Tapetenindustrie
 
Wie sicher sind Online-Zahlungen?
Wie sicher sind Online-Zahlungen?Wie sicher sind Online-Zahlungen?
Wie sicher sind Online-Zahlungen?
 
Deutsche EuroShop - Conference Call Presentation - Preliminary Results FY 2014
Deutsche EuroShop - Conference Call Presentation - Preliminary Results FY 2014Deutsche EuroShop - Conference Call Presentation - Preliminary Results FY 2014
Deutsche EuroShop - Conference Call Presentation - Preliminary Results FY 2014
 

Ähnlich wie Web Archiving – Lessons and Potential

Archiving the French Web: the BnF web archiving workflow. Sara Aubry
Archiving the French Web: the BnF web archiving workflow. Sara AubryArchiving the French Web: the BnF web archiving workflow. Sara Aubry
Archiving the French Web: the BnF web archiving workflow. Sara AubryBiblioteca Nacional de España
 
"Filling the Digital Preservation Gap" with Archivematica
"Filling the Digital Preservation Gap" with Archivematica"Filling the Digital Preservation Gap" with Archivematica
"Filling the Digital Preservation Gap" with ArchivematicaJenny Mitcham
 
Research Cyberinfrastructure at UCSD - David Minor - RDAP12
Research Cyberinfrastructure at UCSD - David Minor - RDAP12Research Cyberinfrastructure at UCSD - David Minor - RDAP12
Research Cyberinfrastructure at UCSD - David Minor - RDAP12ASIS&T
 
Something That Works: Implementing ResourceSpace Open Source Digital Asset Ma...
Something That Works: Implementing ResourceSpace Open Source Digital Asset Ma...Something That Works: Implementing ResourceSpace Open Source Digital Asset Ma...
Something That Works: Implementing ResourceSpace Open Source Digital Asset Ma...dwig
 
Scalability andefficiencypres
Scalability andefficiencypresScalability andefficiencypres
Scalability andefficiencypresNekoGato
 
Kris Carpenter Negulescu Gordon Paynter Archiving the National Web of New Zea...
Kris Carpenter Negulescu Gordon Paynter Archiving the National Web of New Zea...Kris Carpenter Negulescu Gordon Paynter Archiving the National Web of New Zea...
Kris Carpenter Negulescu Gordon Paynter Archiving the National Web of New Zea...Future Perfect 2012
 
Olympya web-tools 2011
Olympya web-tools 2011Olympya web-tools 2011
Olympya web-tools 2011Paulo Mattos
 
Using Archivemedia to preserve research data
Using Archivemedia to preserve research dataUsing Archivemedia to preserve research data
Using Archivemedia to preserve research dataARDC
 
From Box to Hydra via Archivematica
From Box to Hydra via ArchivematicaFrom Box to Hydra via Archivematica
From Box to Hydra via ArchivematicaJisc RDM
 
CONTENTdm Presentation 060711
CONTENTdm Presentation 060711CONTENTdm Presentation 060711
CONTENTdm Presentation 060711Buttes
 
Data management for Quantitative Biology -Basics and challenges in biomedical...
Data management for Quantitative Biology -Basics and challenges in biomedical...Data management for Quantitative Biology -Basics and challenges in biomedical...
Data management for Quantitative Biology -Basics and challenges in biomedical...QBiC_Tue
 
INNOVATION AND ‎RESEARCH (Digital Library ‎Information Access)‎
INNOVATION AND ‎RESEARCH (Digital Library ‎Information Access)‎INNOVATION AND ‎RESEARCH (Digital Library ‎Information Access)‎
INNOVATION AND ‎RESEARCH (Digital Library ‎Information Access)‎Libcorpio
 
Scalable Preservation Workflows
Scalable Preservation WorkflowsScalable Preservation Workflows
Scalable Preservation WorkflowsSCAPE Project
 
SCAPE Presentation at the Elag2013 conference in Gent/Belgium
SCAPE Presentation at the Elag2013 conference in Gent/BelgiumSCAPE Presentation at the Elag2013 conference in Gent/Belgium
SCAPE Presentation at the Elag2013 conference in Gent/BelgiumSven Schlarb
 
Web archiving challenges and opportunities
Web archiving challenges and opportunitiesWeb archiving challenges and opportunities
Web archiving challenges and opportunitiesAhmed AlSum
 
Human Scale Web Collecting for Individuals and Institutions (Webrecorder Work...
Human Scale Web Collecting for Individuals and Institutions (Webrecorder Work...Human Scale Web Collecting for Individuals and Institutions (Webrecorder Work...
Human Scale Web Collecting for Individuals and Institutions (Webrecorder Work...Anna Perricci
 

Ähnlich wie Web Archiving – Lessons and Potential (20)

Archiving the French Web: the BnF web archiving workflow. Sara Aubry
Archiving the French Web: the BnF web archiving workflow. Sara AubryArchiving the French Web: the BnF web archiving workflow. Sara Aubry
Archiving the French Web: the BnF web archiving workflow. Sara Aubry
 
"Filling the Digital Preservation Gap" with Archivematica
"Filling the Digital Preservation Gap" with Archivematica"Filling the Digital Preservation Gap" with Archivematica
"Filling the Digital Preservation Gap" with Archivematica
 
Research Cyberinfrastructure at UCSD - David Minor - RDAP12
Research Cyberinfrastructure at UCSD - David Minor - RDAP12Research Cyberinfrastructure at UCSD - David Minor - RDAP12
Research Cyberinfrastructure at UCSD - David Minor - RDAP12
 
Something That Works: Implementing ResourceSpace Open Source Digital Asset Ma...
Something That Works: Implementing ResourceSpace Open Source Digital Asset Ma...Something That Works: Implementing ResourceSpace Open Source Digital Asset Ma...
Something That Works: Implementing ResourceSpace Open Source Digital Asset Ma...
 
Scalability andefficiencypres
Scalability andefficiencypresScalability andefficiencypres
Scalability andefficiencypres
 
Big data
Big dataBig data
Big data
 
Caplan and York, 'What It Takes To Make It Last: E-Resources Preservation"
Caplan and York, 'What It Takes To Make It Last:  E-Resources Preservation"Caplan and York, 'What It Takes To Make It Last:  E-Resources Preservation"
Caplan and York, 'What It Takes To Make It Last: E-Resources Preservation"
 
Digital Preservation at UNM Libraries
Digital Preservation at UNM LibrariesDigital Preservation at UNM Libraries
Digital Preservation at UNM Libraries
 
Kris Carpenter Negulescu Gordon Paynter Archiving the National Web of New Zea...
Kris Carpenter Negulescu Gordon Paynter Archiving the National Web of New Zea...Kris Carpenter Negulescu Gordon Paynter Archiving the National Web of New Zea...
Kris Carpenter Negulescu Gordon Paynter Archiving the National Web of New Zea...
 
Olympya web-tools 2011
Olympya web-tools 2011Olympya web-tools 2011
Olympya web-tools 2011
 
Using Archivemedia to preserve research data
Using Archivemedia to preserve research dataUsing Archivemedia to preserve research data
Using Archivemedia to preserve research data
 
From Box to Hydra via Archivematica
From Box to Hydra via ArchivematicaFrom Box to Hydra via Archivematica
From Box to Hydra via Archivematica
 
CONTENTdm Presentation 060711
CONTENTdm Presentation 060711CONTENTdm Presentation 060711
CONTENTdm Presentation 060711
 
Data management for Quantitative Biology -Basics and challenges in biomedical...
Data management for Quantitative Biology -Basics and challenges in biomedical...Data management for Quantitative Biology -Basics and challenges in biomedical...
Data management for Quantitative Biology -Basics and challenges in biomedical...
 
Pandora
PandoraPandora
Pandora
 
INNOVATION AND ‎RESEARCH (Digital Library ‎Information Access)‎
INNOVATION AND ‎RESEARCH (Digital Library ‎Information Access)‎INNOVATION AND ‎RESEARCH (Digital Library ‎Information Access)‎
INNOVATION AND ‎RESEARCH (Digital Library ‎Information Access)‎
 
Scalable Preservation Workflows
Scalable Preservation WorkflowsScalable Preservation Workflows
Scalable Preservation Workflows
 
SCAPE Presentation at the Elag2013 conference in Gent/Belgium
SCAPE Presentation at the Elag2013 conference in Gent/BelgiumSCAPE Presentation at the Elag2013 conference in Gent/Belgium
SCAPE Presentation at the Elag2013 conference in Gent/Belgium
 
Web archiving challenges and opportunities
Web archiving challenges and opportunitiesWeb archiving challenges and opportunities
Web archiving challenges and opportunities
 
Human Scale Web Collecting for Individuals and Institutions (Webrecorder Work...
Human Scale Web Collecting for Individuals and Institutions (Webrecorder Work...Human Scale Web Collecting for Individuals and Institutions (Webrecorder Work...
Human Scale Web Collecting for Individuals and Institutions (Webrecorder Work...
 

Kürzlich hochgeladen

SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024Scott Keck-Warren
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsMaria Levchenko
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreternaman860154
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Miguel Araújo
 
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...gurkirankumar98700
 
Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Allon Mureinik
 
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxMalak Abu Hammad
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking MenDelhi Call girls
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slidevu2urc
 
Google AI Hackathon: LLM based Evaluator for RAG
Google AI Hackathon: LLM based Evaluator for RAGGoogle AI Hackathon: LLM based Evaluator for RAG
Google AI Hackathon: LLM based Evaluator for RAGSujit Pal
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptxHampshireHUG
 
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024BookNet Canada
 
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure serviceWhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure servicePooja Nehwal
 
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersEnhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersThousandEyes
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationRidwan Fadjar
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonAnna Loughnan Colquhoun
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountPuma Security, LLC
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)Gabriella Davis
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slidespraypatel2
 

Kürzlich hochgeladen (20)

SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreter
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
 
Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)
 
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptx
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
 
Google AI Hackathon: LLM based Evaluator for RAG
Google AI Hackathon: LLM based Evaluator for RAGGoogle AI Hackathon: LLM based Evaluator for RAG
Google AI Hackathon: LLM based Evaluator for RAG
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
 
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure serviceWhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
 
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersEnhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 Presentation
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path Mount
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slides
 

Web Archiving – Lessons and Potential

  • 1. Panel: Web Archiving – Lessons and Potential Abbie Grotke (Library of Congress) Barbara Signori (Swiss National Library) Clément Oury (Bibliothèque nationale de France) Daniel Gomes (Portuguese Web Archive) Mário J. Silva (INESC-ID) Nuno Freire (European Library)
  • 3. The Portuguese Web Archive project started in 2008
  • 4. It provides version history like the Internet Archive Wayback Machine
  • 5. But also full-text search over 1.2 billion web files archived since 1996
  • 7. We needed to integrate third-party collections archived before 2007 • An archive must have “old stuff” • Integration of historical collections – 1.9 TB from the Internet Archive between 1996 and 2007 – 600 MB CD ROM with sites published in 1996
  • 8. Tools to convert saved web files to ARC format • “Dead” archived collections became searchable and accessible • Specific conversion tools per collection were required but baseline software could be reused
  • 9. Oldest Library of Congress site (October 1996) • The integration effort was worth to save few but valuable information
  • 10. Crawling the live-web since 2007 • Trimestral broad crawls: 78 million files per crawl • Daily selective crawls: 764 000 files per day • Heritrix 1.14.3 initially configured based on previous experience crawling the Portuguese Web – Trial-error process until final configuration • Must recheck configurations periodically
  • 11. The URLs of the publications crawled daily change frequently • Expresso newspaper had 5 different domains since 2008 • Seed list of daily crawls must be periodically validated by humans
  • 12. Default Robots.txt of Content Management Systems forbid crawling images • Developers of popular Content Management Systems are not aware of web archiving – Joomla forbids images since 2007
  • 13. Attempt to raise awareness • Contacted webmasters of the selected publications by email – Only 10% returned feedback • None, raised any objection, just questions. • Some, did not know they had robots exclusion rules on their sites. • Most, did not know what was a “web archive”. • All, were satisfied from being selected as representatives of our cultural heritage
  • 14. • Downloads content, computes checksum and compares it with version from the previous crawl – Unchanged->Discarded – Changed->Stored • No impact on download rate
  • 15. Savings on Trimestral crawls Average disk space per trimestral crawl (TB) 4 3 2 1 0 NoDedup DeDup 41% less disk space to store content
  • 16. Savings on Daily crawls Average disk space per daily crawl (GB) 35 30 25 20 15 10 5 0 NoDedup DeDup 76% less disk space to store content
  • 17. Total savings from using DeDuplicator 26.5 TB/year • Using DeDuplicator saved space without performance degradation.
  • 19. NutchWAX as baseline for full-text search
  • 20. Users were not satisfied with NutchWAX search • Unpolished interface • Slow results – 40M URLs, >20s • Low relevance for search results
  • 21. Developed a new web archive search system • Quicker response times • Improve relevance for search results
  • 22. Had to build a Web Archive Information Retrieval Test Collection: PWA9609 • To evaluate and improve relevance for search results • Corpus of documents from 1996 to 2009 – 255 million web pages (8.9 TB) – 6 collections: Internet Archive, PWA broad crawls, integrated collections • Gold collection – Query, relevant results
  • 23. Time-aware ranking models yield better search results Metric Time-unaware ranking models Time-aware ranking models (our proposals) NutchWAX TVersions TSpan MdRankBoost (L2R) nDCG@1 0.250 0.430 0.450 0.550 nDCG@10 0.174 0.202 0.193 0.555 Precision@1 0.320 0.500 0.520 0.600 Precision@10 0.168 0.172 0.158 0.194 More details: Miguel Costa, Mário J. Silva, Evaluating Web Archive Search Systems, WISE’2012
  • 25. NutchWAX (2007) vs. PWA (2012) • • • • Internationalization support New graphical design Advanced search user interface 71% overall user satisfaction from rounds of usability testing
  • 27. Searching the past web is a confusing concept • Understanding web archiving requires being techie • Must provide examples of web-archived pages
  • 28. Users are addicted to query suggestions • Developed query suggestions mechanism for web archive search
  • 29. Users “google” the past and we have to comply • Users search web archives replicating their behavior from live-web search engines • Users input queries on the first input box that they find – Search system must identify query type (URL or full-text) and present corresponding results • Must provide additional tutorials and contextual help to search the past web
  • 31. Blade Systems/Storage Area Networks vs. Independent servers • 61 computers, 1.8 TB RAM, 340 disks (370 TB) • Blade systems and SAN are not adequate for web archiving – Extremely expensive – Single points of failure – Hard to manage • Independent servers are cheaper and more reliable
  • 33.
  • 34. Just concerns • • • • Respect Robots Exclusion Protocol 1 year embargo Proactively remove illicit content Remove content on-demand by authors
  • 36. API to process archived data using the PWA Hadoop cluster
  • 37. Measure web accessibility for people with disabilities In Rui Lopes, Daniel Gomes, Luís Carriço, Web Not For All: A Large Scale Study of Web Accessibility, 2010
  • 38. Characterizations of the Portuguese Web structure Media type % contents 2005 % contents 2008 Trend Text/html 61.2% 57.8% -5.5% Image/jpeg 22.6% 22.8% +1.2% Image/gif 11.4% 9.4% -17.4% Text/pdf 1.6% 1.9% +18.5% Other 3.2% 8.1% - In João Miranda, Daniel Gomes, Trends in Web characteristics, 2009.
  • 39. Archiving Web Spam degrades search results 1st search result is a Web Spam. 
  • 40. But archiving Web Spam is not useless for research: Improve Web Spam detectors! In A. Garzó et al., Cross-Lingual Web Spam Classification, 2013
  • 41. OpenSearch to extend functionality
  • 42. Web archive search can be easily integrated on web browsers
  • 43. OpenSearch used by Computer Science students to create new web applications • Web application combines information about politicians from several sources: Wikipedia, Youtube, Twitter, Portuguese Web Archive
  • 44. All our source code and test collections are freely available
  • 45. Conclusions • Web archives are crucial infrastructures for modern societies • Must raise awareness about web archiving among users and developers • We need to collaborate
  • 46. Panel discussion 1. How is your experience related to this work? 2. How could web archives be further improved? 3. How could web archives interact with libraries/other cultural heritage organizations? 4. How to unfold the full potential of web archives as research infrastructures? 5. Which innovative collaborations could be established? 6. What is the role of web archiving in modern societies? 7. …