SlideShare ist ein Scribd-Unternehmen logo
1 von 82
WEB ARCHIVING
CHALLENGES & OPPORTUNITIES
PRESENTATIONFOR WEBARCHIVINGENGINEERINGPOSITION
Ahmed AlSum
PhD Candidate
Old Dominion University
Outline
• Engineering Experience
• IBM
• Old Dominion University
• Internet Archive
• Web Archiving Challenges & Opportunities
• Selection
• Harvesting
• Storage
• Access
• Community
• Conclusions
Cairo, Egypt
2006 - 2009
CCSP Project
• An internal IBM support portal that provides client-facing
audiences a by-client, holistic view of client situations
• Technologies: WebSphere Portal, DB2, deployed on
zLinux machines
Responsibilities
• Software Engineer
• Enterprise Applications with J2EE platform technologies for
frontend (Servlets, JSP, Portlet APIs), and backend tasks based on
EJB
• Front-end components based on Web 20 technologies (AJAX
based on dojo 1.0, and Java Script)
• Lotus Sametime (Plugins and Bot development)
• Software engineer team leader
• Support project quality activities
• Lead code review and static analysis activities
Responsibilities
• Administrator
• Deploying Portal solutions on WebSphere Portal
• WebSphere Portal Administration for standalone and clustered
environment
• Administration on Linux and Windows OS
• DB2 server administration for single instance and multiple
instances with HADR support
• Customer support team lead
• Leading customer support activities
Certifications
Sharing IBM Internal Solutions
with Broader Community
Norfolk, VA USA
2009 - 2013
Memento
• Memento is an HTTP
extension to integrate the
Past and the Current
Web
I Jacobs and N Walsh Architecture of the world wide web Technical report, W3C, 2004 http://wwww3org/TR/webarch/
Now
T1
T2
T3
Memento
• Developer and administrator for Memento aggregator and proxies
Memento Clients
• Memento currently is I-D draft, it is promoted to move to
RFC soon.
San Francisco, CA USA
2012
WAT Extraction
• Web Archive Transformation (WAT) is a specification for
structuring metadata generated by Web crawls
• Technologies:
WEB ARCHIVING
Challenges and Opportunities
Web Archive Life Cycle
Hockx-Yu, H, 2011 The Past Issue of the Web In Proceedings of 3rd International Conference on Web Science pp 1–8
Selection
• Decide what to capture
Everything, any domain
National domains
Delegate selection to partners
Users’ favorites
• We studied what is already captured
How Much Of The Web Is
Archived?
S. G. Ainsworth, A. AlSum, H. SalahEldeen, M. C.
Weigle, and M. L. Nelson
In Proceedings of the 11th annual international ACM/IEEE
joint conference on Digital libraries, JCDL
'11, Ottawa, Canada 2011
See also: http://arxiv.org/abs/1212.6177
Archive categories
We have 3 categories of archives
• Internet Archive (classic interface)
• Search engine
• Other archives
Selection
U
K
U
S
Public Archives, ca. Late 2010 / Early 2011
1000 URIs Ordered by First Observation Date
Selection
See also: http://ws-dl.blogspot.com/2011/06/2011-06-23-how-much-of-web-is-archived.html
Memento Distribution, ordered by the first observation date
How Much of the Web is Archived?
It Depends on Which Web…
Selection
Including
SE cache
Excluding
SE Cache
90% 79%
97% 68%
88% 19%
35% 16%
Changes since 2011: no more free SE APIs;
greatly reduced IA quarantine period; 15 public web archives
2013
95%
92%
23%
26%
Profiling Web Archive
Coverage For
Top-level Domain And
Content Language
A. AlSum, M. C. Weigle, M. L. Nelson, and H. Van de Sompel
In Proceedings of the 17th International Conference on
Theory and Practice of Digital Libraries, TPDL 2013, 2013
See also: http://arxiv.org/abs/1309.4008
Where is it archived?
Selection
IA Internet Archive CAN Library and Archives Canada PO Portuguese Web Archive CZ Archive of the Czech Web
LoC Library of Congress BL British Library CAT Web Archive of Catalonia TW National Taiwan University
IC Icelandic Web Archive UK UK National Library CR Croatian Web Archive AIT Archive It
Language Coverage
Selection
IA Internet Archive CAN Library and Archives Canada PO Portuguese Web Archive CZ Archive of the Czech Web
LoC Library of Congress BL British Library CAT Web Archive of Catalonia TW National Taiwan University
IC Icelandic Web Archive UK UK National Library CR Croatian Web Archive AIT Archive It
Growth Rate
Selection
IA Internet Archive CAN Library and Archives Canada PO Portuguese Web Archive CZ Archive of the Czech Web
LoC Library of Congress BL British Library CAT Web Archive of Catalonia TW National Taiwan University
IC Icelandic Web Archive UK UK National Library CR Croatian Web Archive AIT Archive It
Borrowed Portuguese
material from IA
Stopped archiving
since 2008
Steady growth
Stopped getting new
URIs, but still crawling
Selection Research Output
• Some portions of the web are
not well archived such as India
and Africa.
• Profiling helping us in Memento
query routing.
• IIPC proposal with Herbert Van
de Sompel (LANL) and David
Rosenthal (SUL).
Selection
Selection at SUL
• Focus on the missing parts of the Web
• Twitter - Crowdsource:
• UK Web archive: Twittervana
• Internet Memory: Collect URIs from twitter APIs
• VA Tech: CTRNET project
• Stanford Community
• World News collection: 10 news website from each county
• Tools:
Selection
Web Archive Life Cycle
Hockx-Yu, H, 2011 The Past Issue of the Web In Proceedings of 3rd International Conference on Web Science pp 1–8
Harvesting
• Services
• Archive-It
• WAS @ CDLib
• Dedicated servers
• New tools
See also: http://ws-dl.blogspot.com/2013/07/2013-07-10-warcreate-and-wail-warc.html
Special Harvesting Techniques
• Borrow old materials from other web archives
• Ex Stanford WebBase Project*
• 260 TB
• 7 Billion webpages
Harvesting
*http://www-diglib.stanford.edu/~testbed/doc2/WebBase/
Special Harvesting Techniques
• Social Media
• Focus on shared resources in the social media
Harvesting
Hany M SalahEldeen, Michael L Nelson, Losing My Revolution: How Many Resources Shared on Social Media Have Been
Lost?, Proceedings of TPDL 2012
http://ws-dl.blogspot.com/2012/02/2012-02-11-losing-my-revolution-year.html
Special Harvesting Techniques
• SiteStory - Transactional Archive
Harvesting
Justin F Brunelle, Michael L Nelson, Lyudmila Balakireva, Robert Sanderson, Herbert Van de Sompel, Evaluating the SiteStory
Transactional Web Archive With the ApacheBench Tool, Proceedings of TPDL 2013
Sitestory: http://mementoweb.github.io/SiteStory/
Harvesting
• Challenges
• Ajax and Web 2.0/3.0
• Streaming Media
• URI challenges
• Mobile
Harvesting
http://blog.dshr.org/2012/05/harvesting-and-preserving-future-web.html
http://netpreserve.org/sites/default/files/resources/OverviewFutureWebWorkshop.pdf
Web Archive Life Cycle
Hockx-Yu, H, 2011 The Past Issue of the Web In Proceedings of 3rd International Conference on Web Science pp 1–8
Storage (Format)
• Flat files:
• WARC files (ISO standard)
• No-SQL db:
• Hbase at Internet memory*
• Storage at SUL:
• We need to use both
Storage
*Philippe Rigaux, Understanding HBase— The data model, IM technology blog
http://internetmemoryorg/en/indexphp/synapse/understanding_the_hbase_data_model/
Storage (Infrastructure)
• Wrong solution could be a disaster
Storage
Web Archive Life Cycle
Hockx-Yu, H, 2011 The Past Issue of the Web In Proceedings of 3rd International Conference on Web Science pp 1–8
Accessing Web Archive
URI-Based
WayBack Machine
• Textbox to enter the
requested URI
• BubbleMap to show
you the available
mementos
Accessing Web Archive
Full-text search
• Challenges: Temporal
Page Rank, Rank per
site or memento, Date
filtering
Accessing Web Archive
• Thumbnail View
• Trade-off between
building the
thumbnail in real time
or pre-building
Also, trade-off
between representing
the thumbnail by URI
or by embedded
binary data Can we
build partial
thumbnail map?
Accessing Web Archive
• Title View
• Trade-off between, extracting all the titles and keeping it as a
metadata about the memento and extracting the title from the HTML
content on the real time
Implemented using Simile: http://www.simile-widgets.org/timeline/
Accessing Web Archive
• Wayback Machine API
• XML interface for the
list of available
Mementos
Accessing Web Archive
• Web Page Snapshot Replay
• URI
rewriting, javascript, a
nd embedded
resources
Accessing Web Archive
• Page Completeness Degree
• The completeness
degree could be
calculated on the real
time by using the
preserved HTTP
status for the
embedded resources
See also: http://arxiv.org/abs/1309.5503
Accessing Web Archive
• Reconstructing web site
• Current approach is
using the web archive
public interface.
Accessing Web Archive
• Wayback Annotator
• Create collections
• Select and save
relevant content to
their collections
• Annotate & mark
important parts of
archived web pages
• Share their work and
collaborate on
archived content use
http://netpreserve.org/sites/default/files/resources/Predstavitev_07.pdf
http://netpreserve.org/sites/default/files/resources/Wayback_annotator_06.pdf
Accessing Web Archive
Collection-Based
• In addition to
browsing the
collection, you can
browse the URIs in
this collection
• Research questions:
Collection overview
Accessing Web Archive
• Collection visualization
• Term frequency
algorithms should be
normalized to take the
mementos density in
consideration
http://ws-dl.blogspot.com/2012/08/2012-08-10-ms-thesis-visualizing.html
Accessing Web Archive
• Web Archive analytics
See also: http://ilpubs.stanford.edu:8090/1037/1/arcspread.pdf
• ArcSpread took a
query from the
user, extracted related
information and
displayed the results
in spread sheet style.
Who And What Links To The
Internet Archive
Y. Alnoamany, A. AlSum, M. C. Weigle, M. L. Nelson
In Proceedings of 17th International Conference on
Theory and Practice of Digital Libraries, TPDL
2013, 2013 (Best Student Paper)
See also: http://arxiv.org/abs/1309.4016
Serving Robots!
• Log files analysis using Apache Pig
• Access to IA wayback machine as
Robots outnumber Humans
• 10:1 in terms of sessions,
• 5:4 in terms of raw HTTP accesses
• 4:1 in terms of megabytes transferred
Access
Sessions
10
1
HTTP
accesses
5
4
MB
Transferred
4
1
Where do Wayback Machine Users
Come From?
Website Percentage Description
en.wikipedia.org 12.9% Wikipedia
archive.org 11.9% IA Home Page
reddit.com 10.2% Social News Web Site
google.TLD 9.9% Search Engine
info-poland.buffalo.edu 1.5% Polish Studies
de.wikipedia.org 1.4% Wikipedia
cracked.com 1.2% Humor Site
snopes.com 1.1% Urban Legends Reference Pages
facebook.com 0.9% Social Media
crochetpatterncentral.com 0.9% Crocheting Hobbies
Access
Most Languages Self-Link
Access
ArcLink:
Optimization Techniques To Build And Retrieve
The Temporal Web Graph
A. AlSum, M. L. Nelson
IIPC GA 2013, Ljubljana, Slovenia
In Proceedings of the 13th international ACM/IEEE joint
conference on Digital libraries, JCDL '13, 2013
See also: http://arxiv.org/abs/1305.5959
Easy Solved Questions
Q: What are the available mementos for
vancouver2010.com?
Access
Solved Questions, but hard
Q: What are the HTML titles for vancouver2010com
through time?
A Page scraping for all mementos
Access
Impossible Questions
Q What are the anchor-text that pointed to
www.vancouver2010.com through time?
Access
…
<a href=www.vancouver2010.com >
Vancouver Olympics
</a>
….
…
<a href=www.vancouver2010.com >
Winter Olympics
</a>
…
…
<a href=www.vancouver2010.com >
Vancouver 2010
</a>
…
ArcLink
Access
Google code: https://code.google.com/p/arcsys/
Impossible Questions
• Q What are the anchor-text that pointed to
www.vancouver2010.com through time?
Access
Thumbnail Summarization
Techniques For Web
Archives
A. AlSum, and M. L. Nelson
Submitted for publication.
Thumbnails
Access
Internet Archive UK Web archive
Thumbnail Creation Challenges
• Scalability in Time
• IA may need 361 years to create thumbnail per each memento
using one hundred machine
• Scalability in Space
• IA will need 355 TB to store 1 thumbnail per each memento
• Page quality
Access
How many thumbnails do we need?
Access
www.unfi.com on the live Web
How many thumbnails do we need?
Access
www.unfi.com on the live Web
40 Thumbnails are good.
Access
Same technique applied to apple.com
Access
From 8000 Mementos to 69 Thumbnails.
Access
iTunes cover application
Access
Community
• I suggest to be a member in IIPC
• Join the open Wayback Machine team
• Join the Winter Olympics 2014 collaborative project, even as an
observer
Community
• Web Archiving Workshops
WAC 2011, Ottawa, Canada
WAC 2012, Stanford, CA, USA
WADL 2013, Indianapolis, IN, USATempWeb 2013, Rio de Janeiro, Brazil
Tools to SUL Web Archive
• Selection
• Harvest
• Analysis
• Access
Conclusions
• Be Selective: Cover missing parts of the Web
• Be Older: Include WebBase
• Be Smart: Innovative services
• Be Helpful: Researcher Framework/Dataset
• Be Active: Participate in the WA communities
• Make a difference
aalsum@cs.odu.edu
@aalsum
BACKUP
What is missing?
IA Internet Archive CAN Library and Archives Canada PO Portuguese Web Archive CZ Archive of the Czech Web
LoC Library of Congress BL British Library CAT Web Archive of Catalonia TW
National Taiwan
University
IC Icelandic Web Archive UK UK National Library CR Croatian Web Archive AIT Archive It
Thumbnail Features
SimHash DOM tree
Embedded resources Datetime
Clustering technique
WEB ARCHIVING CHALLENGES & OPPORTUNITIES PRESENTATION
WEB ARCHIVING CHALLENGES & OPPORTUNITIES PRESENTATION
WEB ARCHIVING CHALLENGES & OPPORTUNITIES PRESENTATION
WEB ARCHIVING CHALLENGES & OPPORTUNITIES PRESENTATION
WEB ARCHIVING CHALLENGES & OPPORTUNITIES PRESENTATION

Weitere ähnliche Inhalte

Was ist angesagt?

Avalon Media System: Implementation and Community
Avalon Media System: Implementation and CommunityAvalon Media System: Implementation and Community
Avalon Media System: Implementation and CommunityAvalon Media System
 
High and Lows of Library Linked Data
High and Lows of Library Linked DataHigh and Lows of Library Linked Data
High and Lows of Library Linked DataAdrian Stevenson
 
Archiving Occupy (presentation for NYC Digital Asset Managers Meetup)
Archiving Occupy (presentation for NYC Digital Asset Managers Meetup)Archiving Occupy (presentation for NYC Digital Asset Managers Meetup)
Archiving Occupy (presentation for NYC Digital Asset Managers Meetup)Anna Perricci
 
Farl web archiving
Farl web archivingFarl web archiving
Farl web archivingaerho
 
Open access e repositories kelaniya workshop final
Open access e repositories kelaniya workshop finalOpen access e repositories kelaniya workshop final
Open access e repositories kelaniya workshop finalJagath Arachchige
 
OSDPA: One Body, Many Heads: Preservation and Access From Project Hydra
OSDPA: One Body, Many Heads: Preservation and Access From Project HydraOSDPA: One Body, Many Heads: Preservation and Access From Project Hydra
OSDPA: One Body, Many Heads: Preservation and Access From Project HydraAvalon Media System
 
An introduction to the International Internet Preservation Consortium. Mary Pitt
An introduction to the International Internet Preservation Consortium. Mary PittAn introduction to the International Internet Preservation Consortium. Mary Pitt
An introduction to the International Internet Preservation Consortium. Mary PittBiblioteca Nacional de España
 
The Avalon Media System: An Open Source Audio/Video System for Libraries and ...
The Avalon Media System: An Open Source Audio/Video System for Libraries and ...The Avalon Media System: An Open Source Audio/Video System for Libraries and ...
The Avalon Media System: An Open Source Audio/Video System for Libraries and ...Avalon Media System
 
Islandora Webinar: Highlighting CUHK Chinese Digital Collections
Islandora Webinar:  Highlighting CUHK Chinese Digital CollectionsIslandora Webinar:  Highlighting CUHK Chinese Digital Collections
Islandora Webinar: Highlighting CUHK Chinese Digital CollectionsErin Tripp
 
1818 societypresentation revised2013
1818 societypresentation revised20131818 societypresentation revised2013
1818 societypresentation revised2013Eliza McLeod
 
Once and Future Digital Collections
Once and Future Digital CollectionsOnce and Future Digital Collections
Once and Future Digital CollectionsKristen Yarmey
 
Avalon Overview Hydra Connect 2015
Avalon Overview Hydra Connect 2015Avalon Overview Hydra Connect 2015
Avalon Overview Hydra Connect 2015Avalon Media System
 
Pasig hydra preservation presentation 160311
Pasig hydra preservation presentation 160311Pasig hydra preservation presentation 160311
Pasig hydra preservation presentation 160311Chris Awre
 
Mahendra Mahey, British Library Labs
Mahendra Mahey, British Library LabsMahendra Mahey, British Library Labs
Mahendra Mahey, British Library LabsResearchLibrariesUK
 
Researching Researchers: Avalon's Repository Usage
Researching Researchers: Avalon's Repository UsageResearching Researchers: Avalon's Repository Usage
Researching Researchers: Avalon's Repository UsageAvalon Media System
 
Get On The Reference Bus! Wyoming
Get On The Reference Bus! WyomingGet On The Reference Bus! Wyoming
Get On The Reference Bus! WyomingKatie Lynn
 

Was ist angesagt? (20)

Avalon Media System: Implementation and Community
Avalon Media System: Implementation and CommunityAvalon Media System: Implementation and Community
Avalon Media System: Implementation and Community
 
High and Lows of Library Linked Data
High and Lows of Library Linked DataHigh and Lows of Library Linked Data
High and Lows of Library Linked Data
 
NISO Webinar: The Future of Integrated Library Systems PART 2: User Interaction
NISO Webinar: The Future of Integrated Library Systems PART 2: User InteractionNISO Webinar: The Future of Integrated Library Systems PART 2: User Interaction
NISO Webinar: The Future of Integrated Library Systems PART 2: User Interaction
 
Archiving Occupy (presentation for NYC Digital Asset Managers Meetup)
Archiving Occupy (presentation for NYC Digital Asset Managers Meetup)Archiving Occupy (presentation for NYC Digital Asset Managers Meetup)
Archiving Occupy (presentation for NYC Digital Asset Managers Meetup)
 
Farl web archiving
Farl web archivingFarl web archiving
Farl web archiving
 
Open access e repositories kelaniya workshop final
Open access e repositories kelaniya workshop finalOpen access e repositories kelaniya workshop final
Open access e repositories kelaniya workshop final
 
OSDPA: One Body, Many Heads: Preservation and Access From Project Hydra
OSDPA: One Body, Many Heads: Preservation and Access From Project HydraOSDPA: One Body, Many Heads: Preservation and Access From Project Hydra
OSDPA: One Body, Many Heads: Preservation and Access From Project Hydra
 
An introduction to the International Internet Preservation Consortium. Mary Pitt
An introduction to the International Internet Preservation Consortium. Mary PittAn introduction to the International Internet Preservation Consortium. Mary Pitt
An introduction to the International Internet Preservation Consortium. Mary Pitt
 
The Avalon Media System: An Open Source Audio/Video System for Libraries and ...
The Avalon Media System: An Open Source Audio/Video System for Libraries and ...The Avalon Media System: An Open Source Audio/Video System for Libraries and ...
The Avalon Media System: An Open Source Audio/Video System for Libraries and ...
 
Islandora Webinar: Highlighting CUHK Chinese Digital Collections
Islandora Webinar:  Highlighting CUHK Chinese Digital CollectionsIslandora Webinar:  Highlighting CUHK Chinese Digital Collections
Islandora Webinar: Highlighting CUHK Chinese Digital Collections
 
1818 societypresentation revised2013
1818 societypresentation revised20131818 societypresentation revised2013
1818 societypresentation revised2013
 
AtoM Implementations
AtoM ImplementationsAtoM Implementations
AtoM Implementations
 
Once and Future Digital Collections
Once and Future Digital CollectionsOnce and Future Digital Collections
Once and Future Digital Collections
 
Avalon Overview Hydra Connect 2015
Avalon Overview Hydra Connect 2015Avalon Overview Hydra Connect 2015
Avalon Overview Hydra Connect 2015
 
Pasig hydra preservation presentation 160311
Pasig hydra preservation presentation 160311Pasig hydra preservation presentation 160311
Pasig hydra preservation presentation 160311
 
Mahendra Mahey, British Library Labs
Mahendra Mahey, British Library LabsMahendra Mahey, British Library Labs
Mahendra Mahey, British Library Labs
 
The Avalon Media System
The Avalon Media SystemThe Avalon Media System
The Avalon Media System
 
Researching Researchers: Avalon's Repository Usage
Researching Researchers: Avalon's Repository UsageResearching Researchers: Avalon's Repository Usage
Researching Researchers: Avalon's Repository Usage
 
November 19, 2014 NISO Virtual Conference: Can't We All Work Together?: Inter...
November 19, 2014 NISO Virtual Conference: Can't We All Work Together?: Inter...November 19, 2014 NISO Virtual Conference: Can't We All Work Together?: Inter...
November 19, 2014 NISO Virtual Conference: Can't We All Work Together?: Inter...
 
Get On The Reference Bus! Wyoming
Get On The Reference Bus! WyomingGet On The Reference Bus! Wyoming
Get On The Reference Bus! Wyoming
 

Andere mochten auch

Exploring Europeana - Opportunities, Challenges, Inspirations and Plans
Exploring Europeana -  Opportunities, Challenges, Inspirations and PlansExploring Europeana -  Opportunities, Challenges, Inspirations and Plans
Exploring Europeana - Opportunities, Challenges, Inspirations and PlansDavid Haskiya
 
Restoring US First Website
Restoring US First WebsiteRestoring US First Website
Restoring US First WebsiteAhmed AlSum
 
Thumbnail Summarization Techniques For Web Archives
Thumbnail Summarization Techniques For Web ArchivesThumbnail Summarization Techniques For Web Archives
Thumbnail Summarization Techniques For Web ArchivesAhmed AlSum
 
"Web Archive services framework for tighter integration between the past and ...
"Web Archive services framework for tighter integration between the past and ..."Web Archive services framework for tighter integration between the past and ...
"Web Archive services framework for tighter integration between the past and ...Ahmed AlSum
 
ArcLink - IIPC GA 2013
ArcLink - IIPC GA 2013ArcLink - IIPC GA 2013
ArcLink - IIPC GA 2013Ahmed AlSum
 
E commerce in India literature review
E commerce in India literature reviewE commerce in India literature review
E commerce in India literature reviewAbhishek Yadav
 

Andere mochten auch (7)

Exploring Europeana - Opportunities, Challenges, Inspirations and Plans
Exploring Europeana -  Opportunities, Challenges, Inspirations and PlansExploring Europeana -  Opportunities, Challenges, Inspirations and Plans
Exploring Europeana - Opportunities, Challenges, Inspirations and Plans
 
Restoring US First Website
Restoring US First WebsiteRestoring US First Website
Restoring US First Website
 
Thumbnail Summarization Techniques For Web Archives
Thumbnail Summarization Techniques For Web ArchivesThumbnail Summarization Techniques For Web Archives
Thumbnail Summarization Techniques For Web Archives
 
"Web Archive services framework for tighter integration between the past and ...
"Web Archive services framework for tighter integration between the past and ..."Web Archive services framework for tighter integration between the past and ...
"Web Archive services framework for tighter integration between the past and ...
 
AIESEC Opportunities Portal Guide
AIESEC Opportunities Portal GuideAIESEC Opportunities Portal Guide
AIESEC Opportunities Portal Guide
 
ArcLink - IIPC GA 2013
ArcLink - IIPC GA 2013ArcLink - IIPC GA 2013
ArcLink - IIPC GA 2013
 
E commerce in India literature review
E commerce in India literature reviewE commerce in India literature review
E commerce in India literature review
 

Ähnlich wie WEB ARCHIVING CHALLENGES & OPPORTUNITIES PRESENTATION

Slides anu talkwebarchivingaug2012
Slides anu talkwebarchivingaug2012Slides anu talkwebarchivingaug2012
Slides anu talkwebarchivingaug2012Roxanne Missingham
 
Archiving Web-Based #musetech for Institutional Memory
Archiving Web-Based #musetech for Institutional MemoryArchiving Web-Based #musetech for Institutional Memory
Archiving Web-Based #musetech for Institutional MemorySamantha Norling
 
SAFETY NETS: RESCUE AND REVIVAL FOR ENDANGERED BORN-DIGITAL RECORDS- Program ...
SAFETY NETS: RESCUE AND REVIVAL FOR ENDANGERED BORN-DIGITAL RECORDS- Program ...SAFETY NETS: RESCUE AND REVIVAL FOR ENDANGERED BORN-DIGITAL RECORDS- Program ...
SAFETY NETS: RESCUE AND REVIVAL FOR ENDANGERED BORN-DIGITAL RECORDS- Program ...Micah Altman
 
Archiving the French Web: the BnF web archiving workflow. Sara Aubry
Archiving the French Web: the BnF web archiving workflow. Sara AubryArchiving the French Web: the BnF web archiving workflow. Sara Aubry
Archiving the French Web: the BnF web archiving workflow. Sara AubryBiblioteca Nacional de España
 
Browser-Based Digital Preservation
Browser-Based Digital PreservationBrowser-Based Digital Preservation
Browser-Based Digital PreservationMat Kelly
 
IIIF and Mirador at the YCBA: image based scholarly collaboration and research
IIIF and Mirador at the YCBA: image based scholarly collaboration and researchIIIF and Mirador at the YCBA: image based scholarly collaboration and research
IIIF and Mirador at the YCBA: image based scholarly collaboration and researchAmerican Art Collaborative
 
Creating and Maintaining Web Archives
Creating and Maintaining Web ArchivesCreating and Maintaining Web Archives
Creating and Maintaining Web ArchivesMARAC Bethlehem PC
 
Progress Made and Lessons Learned through Collaborative Web Archiving Proj...
Progress Made and Lessons Learned through Collaborative Web Archiving Proj...Progress Made and Lessons Learned through Collaborative Web Archiving Proj...
Progress Made and Lessons Learned through Collaborative Web Archiving Proj...Anna Perricci
 
Human Scale Web Collecting for Individuals and Institutions (Webrecorder Work...
Human Scale Web Collecting for Individuals and Institutions (Webrecorder Work...Human Scale Web Collecting for Individuals and Institutions (Webrecorder Work...
Human Scale Web Collecting for Individuals and Institutions (Webrecorder Work...Anna Perricci
 
Intro to IIIF and IIIF @NLW
Intro to IIIF and IIIF @NLWIntro to IIIF and IIIF @NLW
Intro to IIIF and IIIF @NLWGlen Robson
 
Slides for Web Archiving in the Heritage and Archive Sectors
Slides for Web Archiving in the Heritage and Archive SectorsSlides for Web Archiving in the Heritage and Archive Sectors
Slides for Web Archiving in the Heritage and Archive SectorsAnna Perricci
 
WEB ARCHIVING PROJECTS END-USER PERSPECTIVE
WEB ARCHIVING PROJECTS END-USER PERSPECTIVEWEB ARCHIVING PROJECTS END-USER PERSPECTIVE
WEB ARCHIVING PROJECTS END-USER PERSPECTIVEBogdan Trifunovic
 
IIIF Introduction given in South Africa - 2019
IIIF Introduction given in South Africa - 2019IIIF Introduction given in South Africa - 2019
IIIF Introduction given in South Africa - 2019Glen Robson
 
Preservation of Research Data: Dataverse / Archivematica Integration by Allan...
Preservation of Research Data: Dataverse / Archivematica Integration by Allan...Preservation of Research Data: Dataverse / Archivematica Integration by Allan...
Preservation of Research Data: Dataverse / Archivematica Integration by Allan...datascienceiqss
 
WebART: Facilitating Scholarly Use of Web Archives (IIPC, Apr. 2013)
WebART: Facilitating Scholarly Use of Web Archives (IIPC, Apr. 2013)WebART: Facilitating Scholarly Use of Web Archives (IIPC, Apr. 2013)
WebART: Facilitating Scholarly Use of Web Archives (IIPC, Apr. 2013)TimelessFuture
 
Digitizing Spectator - Libraries Digital Program
Digitizing Spectator - Libraries Digital ProgramDigitizing Spectator - Libraries Digital Program
Digitizing Spectator - Libraries Digital ProgramRobert Frech
 
Capture All the URLS: First Steps in Web Archiving
Capture All the URLS: First Steps in Web ArchivingCapture All the URLS: First Steps in Web Archiving
Capture All the URLS: First Steps in Web ArchivingKristen Yarmey
 
Exposing Library Content with the NISO Metasearch XML Gateway Protocol
Exposing Library Content with the NISO Metasearch XML Gateway ProtocolExposing Library Content with the NISO Metasearch XML Gateway Protocol
Exposing Library Content with the NISO Metasearch XML Gateway ProtocolElectronic Resources & Libraries
 

Ähnlich wie WEB ARCHIVING CHALLENGES & OPPORTUNITIES PRESENTATION (20)

Slides anu talkwebarchivingaug2012
Slides anu talkwebarchivingaug2012Slides anu talkwebarchivingaug2012
Slides anu talkwebarchivingaug2012
 
Archiving Web-Based #musetech for Institutional Memory
Archiving Web-Based #musetech for Institutional MemoryArchiving Web-Based #musetech for Institutional Memory
Archiving Web-Based #musetech for Institutional Memory
 
Internet content as research data
Internet content as research dataInternet content as research data
Internet content as research data
 
SAFETY NETS: RESCUE AND REVIVAL FOR ENDANGERED BORN-DIGITAL RECORDS- Program ...
SAFETY NETS: RESCUE AND REVIVAL FOR ENDANGERED BORN-DIGITAL RECORDS- Program ...SAFETY NETS: RESCUE AND REVIVAL FOR ENDANGERED BORN-DIGITAL RECORDS- Program ...
SAFETY NETS: RESCUE AND REVIVAL FOR ENDANGERED BORN-DIGITAL RECORDS- Program ...
 
Archiving the French Web: the BnF web archiving workflow. Sara Aubry
Archiving the French Web: the BnF web archiving workflow. Sara AubryArchiving the French Web: the BnF web archiving workflow. Sara Aubry
Archiving the French Web: the BnF web archiving workflow. Sara Aubry
 
Browser-Based Digital Preservation
Browser-Based Digital PreservationBrowser-Based Digital Preservation
Browser-Based Digital Preservation
 
IIIF and Mirador at the YCBA: image based scholarly collaboration and research
IIIF and Mirador at the YCBA: image based scholarly collaboration and researchIIIF and Mirador at the YCBA: image based scholarly collaboration and research
IIIF and Mirador at the YCBA: image based scholarly collaboration and research
 
Creating and Maintaining Web Archives
Creating and Maintaining Web ArchivesCreating and Maintaining Web Archives
Creating and Maintaining Web Archives
 
Progress Made and Lessons Learned through Collaborative Web Archiving Proj...
Progress Made and Lessons Learned through Collaborative Web Archiving Proj...Progress Made and Lessons Learned through Collaborative Web Archiving Proj...
Progress Made and Lessons Learned through Collaborative Web Archiving Proj...
 
Human Scale Web Collecting for Individuals and Institutions (Webrecorder Work...
Human Scale Web Collecting for Individuals and Institutions (Webrecorder Work...Human Scale Web Collecting for Individuals and Institutions (Webrecorder Work...
Human Scale Web Collecting for Individuals and Institutions (Webrecorder Work...
 
Intro to IIIF and IIIF @NLW
Intro to IIIF and IIIF @NLWIntro to IIIF and IIIF @NLW
Intro to IIIF and IIIF @NLW
 
Slides for Web Archiving in the Heritage and Archive Sectors
Slides for Web Archiving in the Heritage and Archive SectorsSlides for Web Archiving in the Heritage and Archive Sectors
Slides for Web Archiving in the Heritage and Archive Sectors
 
WEB ARCHIVING PROJECTS END-USER PERSPECTIVE
WEB ARCHIVING PROJECTS END-USER PERSPECTIVEWEB ARCHIVING PROJECTS END-USER PERSPECTIVE
WEB ARCHIVING PROJECTS END-USER PERSPECTIVE
 
IIIF Introduction given in South Africa - 2019
IIIF Introduction given in South Africa - 2019IIIF Introduction given in South Africa - 2019
IIIF Introduction given in South Africa - 2019
 
Preservation of Research Data: Dataverse / Archivematica Integration by Allan...
Preservation of Research Data: Dataverse / Archivematica Integration by Allan...Preservation of Research Data: Dataverse / Archivematica Integration by Allan...
Preservation of Research Data: Dataverse / Archivematica Integration by Allan...
 
WebART: Facilitating Scholarly Use of Web Archives (IIPC, Apr. 2013)
WebART: Facilitating Scholarly Use of Web Archives (IIPC, Apr. 2013)WebART: Facilitating Scholarly Use of Web Archives (IIPC, Apr. 2013)
WebART: Facilitating Scholarly Use of Web Archives (IIPC, Apr. 2013)
 
Digitizing Spectator - Libraries Digital Program
Digitizing Spectator - Libraries Digital ProgramDigitizing Spectator - Libraries Digital Program
Digitizing Spectator - Libraries Digital Program
 
Capture All the URLS: First Steps in Web Archiving
Capture All the URLS: First Steps in Web ArchivingCapture All the URLS: First Steps in Web Archiving
Capture All the URLS: First Steps in Web Archiving
 
Exposing Library Content with the NISO Metasearch XML Gateway Protocol
Exposing Library Content with the NISO Metasearch XML Gateway ProtocolExposing Library Content with the NISO Metasearch XML Gateway Protocol
Exposing Library Content with the NISO Metasearch XML Gateway Protocol
 
Archivematica Community Update - SAA 2016
Archivematica Community Update - SAA 2016Archivematica Community Update - SAA 2016
Archivematica Community Update - SAA 2016
 

Kürzlich hochgeladen

CRM Contender Series: HubSpot vs. Salesforce
CRM Contender Series: HubSpot vs. SalesforceCRM Contender Series: HubSpot vs. Salesforce
CRM Contender Series: HubSpot vs. SalesforceBrainSell Technologies
 
MYjobs Presentation Django-based project
MYjobs Presentation Django-based projectMYjobs Presentation Django-based project
MYjobs Presentation Django-based projectAnoyGreter
 
Software Coding for software engineering
Software Coding for software engineeringSoftware Coding for software engineering
Software Coding for software engineeringssuserb3a23b
 
What is Fashion PLM and Why Do You Need It
What is Fashion PLM and Why Do You Need ItWhat is Fashion PLM and Why Do You Need It
What is Fashion PLM and Why Do You Need ItWave PLM
 
GOING AOT WITH GRAALVM – DEVOXX GREECE.pdf
GOING AOT WITH GRAALVM – DEVOXX GREECE.pdfGOING AOT WITH GRAALVM – DEVOXX GREECE.pdf
GOING AOT WITH GRAALVM – DEVOXX GREECE.pdfAlina Yurenko
 
Dealing with Cultural Dispersion — Stefano Lambiase — ICSE-SEIS 2024
Dealing with Cultural Dispersion — Stefano Lambiase — ICSE-SEIS 2024Dealing with Cultural Dispersion — Stefano Lambiase — ICSE-SEIS 2024
Dealing with Cultural Dispersion — Stefano Lambiase — ICSE-SEIS 2024StefanoLambiase
 
Balasore Best It Company|| Top 10 IT Company || Balasore Software company Odisha
Balasore Best It Company|| Top 10 IT Company || Balasore Software company OdishaBalasore Best It Company|| Top 10 IT Company || Balasore Software company Odisha
Balasore Best It Company|| Top 10 IT Company || Balasore Software company Odishasmiwainfosol
 
Sending Calendar Invites on SES and Calendarsnack.pdf
Sending Calendar Invites on SES and Calendarsnack.pdfSending Calendar Invites on SES and Calendarsnack.pdf
Sending Calendar Invites on SES and Calendarsnack.pdf31events.com
 
Alfresco TTL#157 - Troubleshooting Made Easy: Deciphering Alfresco mTLS Confi...
Alfresco TTL#157 - Troubleshooting Made Easy: Deciphering Alfresco mTLS Confi...Alfresco TTL#157 - Troubleshooting Made Easy: Deciphering Alfresco mTLS Confi...
Alfresco TTL#157 - Troubleshooting Made Easy: Deciphering Alfresco mTLS Confi...Angel Borroy López
 
PREDICTING RIVER WATER QUALITY ppt presentation
PREDICTING  RIVER  WATER QUALITY  ppt presentationPREDICTING  RIVER  WATER QUALITY  ppt presentation
PREDICTING RIVER WATER QUALITY ppt presentationvaddepallysandeep122
 
SensoDat: Simulation-based Sensor Dataset of Self-driving Cars
SensoDat: Simulation-based Sensor Dataset of Self-driving CarsSensoDat: Simulation-based Sensor Dataset of Self-driving Cars
SensoDat: Simulation-based Sensor Dataset of Self-driving CarsChristian Birchler
 
Folding Cheat Sheet #4 - fourth in a series
Folding Cheat Sheet #4 - fourth in a seriesFolding Cheat Sheet #4 - fourth in a series
Folding Cheat Sheet #4 - fourth in a seriesPhilip Schwarz
 
Comparing Linux OS Image Update Models - EOSS 2024.pdf
Comparing Linux OS Image Update Models - EOSS 2024.pdfComparing Linux OS Image Update Models - EOSS 2024.pdf
Comparing Linux OS Image Update Models - EOSS 2024.pdfDrew Moseley
 
Unveiling the Future: Sylius 2.0 New Features
Unveiling the Future: Sylius 2.0 New FeaturesUnveiling the Future: Sylius 2.0 New Features
Unveiling the Future: Sylius 2.0 New FeaturesŁukasz Chruściel
 
Software Project Health Check: Best Practices and Techniques for Your Product...
Software Project Health Check: Best Practices and Techniques for Your Product...Software Project Health Check: Best Practices and Techniques for Your Product...
Software Project Health Check: Best Practices and Techniques for Your Product...Velvetech LLC
 
Unveiling Design Patterns: A Visual Guide with UML Diagrams
Unveiling Design Patterns: A Visual Guide with UML DiagramsUnveiling Design Patterns: A Visual Guide with UML Diagrams
Unveiling Design Patterns: A Visual Guide with UML DiagramsAhmed Mohamed
 
Catch the Wave: SAP Event-Driven and Data Streaming for the Intelligence Ente...
Catch the Wave: SAP Event-Driven and Data Streaming for the Intelligence Ente...Catch the Wave: SAP Event-Driven and Data Streaming for the Intelligence Ente...
Catch the Wave: SAP Event-Driven and Data Streaming for the Intelligence Ente...confluent
 
Recruitment Management Software Benefits (Infographic)
Recruitment Management Software Benefits (Infographic)Recruitment Management Software Benefits (Infographic)
Recruitment Management Software Benefits (Infographic)Hr365.us smith
 
Intelligent Home Wi-Fi Solutions | ThinkPalm
Intelligent Home Wi-Fi Solutions | ThinkPalmIntelligent Home Wi-Fi Solutions | ThinkPalm
Intelligent Home Wi-Fi Solutions | ThinkPalmSujith Sukumaran
 

Kürzlich hochgeladen (20)

CRM Contender Series: HubSpot vs. Salesforce
CRM Contender Series: HubSpot vs. SalesforceCRM Contender Series: HubSpot vs. Salesforce
CRM Contender Series: HubSpot vs. Salesforce
 
MYjobs Presentation Django-based project
MYjobs Presentation Django-based projectMYjobs Presentation Django-based project
MYjobs Presentation Django-based project
 
Software Coding for software engineering
Software Coding for software engineeringSoftware Coding for software engineering
Software Coding for software engineering
 
What is Fashion PLM and Why Do You Need It
What is Fashion PLM and Why Do You Need ItWhat is Fashion PLM and Why Do You Need It
What is Fashion PLM and Why Do You Need It
 
GOING AOT WITH GRAALVM – DEVOXX GREECE.pdf
GOING AOT WITH GRAALVM – DEVOXX GREECE.pdfGOING AOT WITH GRAALVM – DEVOXX GREECE.pdf
GOING AOT WITH GRAALVM – DEVOXX GREECE.pdf
 
Dealing with Cultural Dispersion — Stefano Lambiase — ICSE-SEIS 2024
Dealing with Cultural Dispersion — Stefano Lambiase — ICSE-SEIS 2024Dealing with Cultural Dispersion — Stefano Lambiase — ICSE-SEIS 2024
Dealing with Cultural Dispersion — Stefano Lambiase — ICSE-SEIS 2024
 
Balasore Best It Company|| Top 10 IT Company || Balasore Software company Odisha
Balasore Best It Company|| Top 10 IT Company || Balasore Software company OdishaBalasore Best It Company|| Top 10 IT Company || Balasore Software company Odisha
Balasore Best It Company|| Top 10 IT Company || Balasore Software company Odisha
 
Advantages of Odoo ERP 17 for Your Business
Advantages of Odoo ERP 17 for Your BusinessAdvantages of Odoo ERP 17 for Your Business
Advantages of Odoo ERP 17 for Your Business
 
Sending Calendar Invites on SES and Calendarsnack.pdf
Sending Calendar Invites on SES and Calendarsnack.pdfSending Calendar Invites on SES and Calendarsnack.pdf
Sending Calendar Invites on SES and Calendarsnack.pdf
 
Alfresco TTL#157 - Troubleshooting Made Easy: Deciphering Alfresco mTLS Confi...
Alfresco TTL#157 - Troubleshooting Made Easy: Deciphering Alfresco mTLS Confi...Alfresco TTL#157 - Troubleshooting Made Easy: Deciphering Alfresco mTLS Confi...
Alfresco TTL#157 - Troubleshooting Made Easy: Deciphering Alfresco mTLS Confi...
 
PREDICTING RIVER WATER QUALITY ppt presentation
PREDICTING  RIVER  WATER QUALITY  ppt presentationPREDICTING  RIVER  WATER QUALITY  ppt presentation
PREDICTING RIVER WATER QUALITY ppt presentation
 
SensoDat: Simulation-based Sensor Dataset of Self-driving Cars
SensoDat: Simulation-based Sensor Dataset of Self-driving CarsSensoDat: Simulation-based Sensor Dataset of Self-driving Cars
SensoDat: Simulation-based Sensor Dataset of Self-driving Cars
 
Folding Cheat Sheet #4 - fourth in a series
Folding Cheat Sheet #4 - fourth in a seriesFolding Cheat Sheet #4 - fourth in a series
Folding Cheat Sheet #4 - fourth in a series
 
Comparing Linux OS Image Update Models - EOSS 2024.pdf
Comparing Linux OS Image Update Models - EOSS 2024.pdfComparing Linux OS Image Update Models - EOSS 2024.pdf
Comparing Linux OS Image Update Models - EOSS 2024.pdf
 
Unveiling the Future: Sylius 2.0 New Features
Unveiling the Future: Sylius 2.0 New FeaturesUnveiling the Future: Sylius 2.0 New Features
Unveiling the Future: Sylius 2.0 New Features
 
Software Project Health Check: Best Practices and Techniques for Your Product...
Software Project Health Check: Best Practices and Techniques for Your Product...Software Project Health Check: Best Practices and Techniques for Your Product...
Software Project Health Check: Best Practices and Techniques for Your Product...
 
Unveiling Design Patterns: A Visual Guide with UML Diagrams
Unveiling Design Patterns: A Visual Guide with UML DiagramsUnveiling Design Patterns: A Visual Guide with UML Diagrams
Unveiling Design Patterns: A Visual Guide with UML Diagrams
 
Catch the Wave: SAP Event-Driven and Data Streaming for the Intelligence Ente...
Catch the Wave: SAP Event-Driven and Data Streaming for the Intelligence Ente...Catch the Wave: SAP Event-Driven and Data Streaming for the Intelligence Ente...
Catch the Wave: SAP Event-Driven and Data Streaming for the Intelligence Ente...
 
Recruitment Management Software Benefits (Infographic)
Recruitment Management Software Benefits (Infographic)Recruitment Management Software Benefits (Infographic)
Recruitment Management Software Benefits (Infographic)
 
Intelligent Home Wi-Fi Solutions | ThinkPalm
Intelligent Home Wi-Fi Solutions | ThinkPalmIntelligent Home Wi-Fi Solutions | ThinkPalm
Intelligent Home Wi-Fi Solutions | ThinkPalm
 

WEB ARCHIVING CHALLENGES & OPPORTUNITIES PRESENTATION

  • 1. WEB ARCHIVING CHALLENGES & OPPORTUNITIES PRESENTATIONFOR WEBARCHIVINGENGINEERINGPOSITION Ahmed AlSum PhD Candidate Old Dominion University
  • 2. Outline • Engineering Experience • IBM • Old Dominion University • Internet Archive • Web Archiving Challenges & Opportunities • Selection • Harvesting • Storage • Access • Community • Conclusions
  • 4. CCSP Project • An internal IBM support portal that provides client-facing audiences a by-client, holistic view of client situations • Technologies: WebSphere Portal, DB2, deployed on zLinux machines
  • 5. Responsibilities • Software Engineer • Enterprise Applications with J2EE platform technologies for frontend (Servlets, JSP, Portlet APIs), and backend tasks based on EJB • Front-end components based on Web 20 technologies (AJAX based on dojo 1.0, and Java Script) • Lotus Sametime (Plugins and Bot development) • Software engineer team leader • Support project quality activities • Lead code review and static analysis activities
  • 6. Responsibilities • Administrator • Deploying Portal solutions on WebSphere Portal • WebSphere Portal Administration for standalone and clustered environment • Administration on Linux and Windows OS • DB2 server administration for single instance and multiple instances with HADR support • Customer support team lead • Leading customer support activities
  • 8. Sharing IBM Internal Solutions with Broader Community
  • 10. Memento • Memento is an HTTP extension to integrate the Past and the Current Web I Jacobs and N Walsh Architecture of the world wide web Technical report, W3C, 2004 http://wwww3org/TR/webarch/ Now T1 T2 T3
  • 11. Memento • Developer and administrator for Memento aggregator and proxies
  • 12. Memento Clients • Memento currently is I-D draft, it is promoted to move to RFC soon.
  • 13. San Francisco, CA USA 2012
  • 14. WAT Extraction • Web Archive Transformation (WAT) is a specification for structuring metadata generated by Web crawls • Technologies:
  • 16. Web Archive Life Cycle Hockx-Yu, H, 2011 The Past Issue of the Web In Proceedings of 3rd International Conference on Web Science pp 1–8
  • 17. Selection • Decide what to capture Everything, any domain National domains Delegate selection to partners Users’ favorites • We studied what is already captured
  • 18. How Much Of The Web Is Archived? S. G. Ainsworth, A. AlSum, H. SalahEldeen, M. C. Weigle, and M. L. Nelson In Proceedings of the 11th annual international ACM/IEEE joint conference on Digital libraries, JCDL '11, Ottawa, Canada 2011 See also: http://arxiv.org/abs/1212.6177
  • 19. Archive categories We have 3 categories of archives • Internet Archive (classic interface) • Search engine • Other archives Selection U K U S Public Archives, ca. Late 2010 / Early 2011
  • 20. 1000 URIs Ordered by First Observation Date Selection See also: http://ws-dl.blogspot.com/2011/06/2011-06-23-how-much-of-web-is-archived.html
  • 21. Memento Distribution, ordered by the first observation date
  • 22. How Much of the Web is Archived? It Depends on Which Web… Selection Including SE cache Excluding SE Cache 90% 79% 97% 68% 88% 19% 35% 16% Changes since 2011: no more free SE APIs; greatly reduced IA quarantine period; 15 public web archives 2013 95% 92% 23% 26%
  • 23. Profiling Web Archive Coverage For Top-level Domain And Content Language A. AlSum, M. C. Weigle, M. L. Nelson, and H. Van de Sompel In Proceedings of the 17th International Conference on Theory and Practice of Digital Libraries, TPDL 2013, 2013 See also: http://arxiv.org/abs/1309.4008
  • 24. Where is it archived? Selection IA Internet Archive CAN Library and Archives Canada PO Portuguese Web Archive CZ Archive of the Czech Web LoC Library of Congress BL British Library CAT Web Archive of Catalonia TW National Taiwan University IC Icelandic Web Archive UK UK National Library CR Croatian Web Archive AIT Archive It
  • 25. Language Coverage Selection IA Internet Archive CAN Library and Archives Canada PO Portuguese Web Archive CZ Archive of the Czech Web LoC Library of Congress BL British Library CAT Web Archive of Catalonia TW National Taiwan University IC Icelandic Web Archive UK UK National Library CR Croatian Web Archive AIT Archive It
  • 26. Growth Rate Selection IA Internet Archive CAN Library and Archives Canada PO Portuguese Web Archive CZ Archive of the Czech Web LoC Library of Congress BL British Library CAT Web Archive of Catalonia TW National Taiwan University IC Icelandic Web Archive UK UK National Library CR Croatian Web Archive AIT Archive It Borrowed Portuguese material from IA Stopped archiving since 2008 Steady growth Stopped getting new URIs, but still crawling
  • 27. Selection Research Output • Some portions of the web are not well archived such as India and Africa. • Profiling helping us in Memento query routing. • IIPC proposal with Herbert Van de Sompel (LANL) and David Rosenthal (SUL). Selection
  • 28. Selection at SUL • Focus on the missing parts of the Web • Twitter - Crowdsource: • UK Web archive: Twittervana • Internet Memory: Collect URIs from twitter APIs • VA Tech: CTRNET project • Stanford Community • World News collection: 10 news website from each county • Tools: Selection
  • 29. Web Archive Life Cycle Hockx-Yu, H, 2011 The Past Issue of the Web In Proceedings of 3rd International Conference on Web Science pp 1–8
  • 30. Harvesting • Services • Archive-It • WAS @ CDLib • Dedicated servers • New tools See also: http://ws-dl.blogspot.com/2013/07/2013-07-10-warcreate-and-wail-warc.html
  • 31. Special Harvesting Techniques • Borrow old materials from other web archives • Ex Stanford WebBase Project* • 260 TB • 7 Billion webpages Harvesting *http://www-diglib.stanford.edu/~testbed/doc2/WebBase/
  • 32. Special Harvesting Techniques • Social Media • Focus on shared resources in the social media Harvesting Hany M SalahEldeen, Michael L Nelson, Losing My Revolution: How Many Resources Shared on Social Media Have Been Lost?, Proceedings of TPDL 2012 http://ws-dl.blogspot.com/2012/02/2012-02-11-losing-my-revolution-year.html
  • 33. Special Harvesting Techniques • SiteStory - Transactional Archive Harvesting Justin F Brunelle, Michael L Nelson, Lyudmila Balakireva, Robert Sanderson, Herbert Van de Sompel, Evaluating the SiteStory Transactional Web Archive With the ApacheBench Tool, Proceedings of TPDL 2013 Sitestory: http://mementoweb.github.io/SiteStory/
  • 34. Harvesting • Challenges • Ajax and Web 2.0/3.0 • Streaming Media • URI challenges • Mobile Harvesting http://blog.dshr.org/2012/05/harvesting-and-preserving-future-web.html http://netpreserve.org/sites/default/files/resources/OverviewFutureWebWorkshop.pdf
  • 35. Web Archive Life Cycle Hockx-Yu, H, 2011 The Past Issue of the Web In Proceedings of 3rd International Conference on Web Science pp 1–8
  • 36. Storage (Format) • Flat files: • WARC files (ISO standard) • No-SQL db: • Hbase at Internet memory* • Storage at SUL: • We need to use both Storage *Philippe Rigaux, Understanding HBase— The data model, IM technology blog http://internetmemoryorg/en/indexphp/synapse/understanding_the_hbase_data_model/
  • 37. Storage (Infrastructure) • Wrong solution could be a disaster Storage
  • 38. Web Archive Life Cycle Hockx-Yu, H, 2011 The Past Issue of the Web In Proceedings of 3rd International Conference on Web Science pp 1–8
  • 39. Accessing Web Archive URI-Based WayBack Machine • Textbox to enter the requested URI • BubbleMap to show you the available mementos
  • 40. Accessing Web Archive Full-text search • Challenges: Temporal Page Rank, Rank per site or memento, Date filtering
  • 41. Accessing Web Archive • Thumbnail View • Trade-off between building the thumbnail in real time or pre-building Also, trade-off between representing the thumbnail by URI or by embedded binary data Can we build partial thumbnail map?
  • 42. Accessing Web Archive • Title View • Trade-off between, extracting all the titles and keeping it as a metadata about the memento and extracting the title from the HTML content on the real time Implemented using Simile: http://www.simile-widgets.org/timeline/
  • 43. Accessing Web Archive • Wayback Machine API • XML interface for the list of available Mementos
  • 44. Accessing Web Archive • Web Page Snapshot Replay • URI rewriting, javascript, a nd embedded resources
  • 45. Accessing Web Archive • Page Completeness Degree • The completeness degree could be calculated on the real time by using the preserved HTTP status for the embedded resources See also: http://arxiv.org/abs/1309.5503
  • 46. Accessing Web Archive • Reconstructing web site • Current approach is using the web archive public interface.
  • 47. Accessing Web Archive • Wayback Annotator • Create collections • Select and save relevant content to their collections • Annotate & mark important parts of archived web pages • Share their work and collaborate on archived content use http://netpreserve.org/sites/default/files/resources/Predstavitev_07.pdf http://netpreserve.org/sites/default/files/resources/Wayback_annotator_06.pdf
  • 48. Accessing Web Archive Collection-Based • In addition to browsing the collection, you can browse the URIs in this collection • Research questions: Collection overview
  • 49. Accessing Web Archive • Collection visualization • Term frequency algorithms should be normalized to take the mementos density in consideration http://ws-dl.blogspot.com/2012/08/2012-08-10-ms-thesis-visualizing.html
  • 50. Accessing Web Archive • Web Archive analytics See also: http://ilpubs.stanford.edu:8090/1037/1/arcspread.pdf • ArcSpread took a query from the user, extracted related information and displayed the results in spread sheet style.
  • 51. Who And What Links To The Internet Archive Y. Alnoamany, A. AlSum, M. C. Weigle, M. L. Nelson In Proceedings of 17th International Conference on Theory and Practice of Digital Libraries, TPDL 2013, 2013 (Best Student Paper) See also: http://arxiv.org/abs/1309.4016
  • 52. Serving Robots! • Log files analysis using Apache Pig • Access to IA wayback machine as Robots outnumber Humans • 10:1 in terms of sessions, • 5:4 in terms of raw HTTP accesses • 4:1 in terms of megabytes transferred Access Sessions 10 1 HTTP accesses 5 4 MB Transferred 4 1
  • 53. Where do Wayback Machine Users Come From? Website Percentage Description en.wikipedia.org 12.9% Wikipedia archive.org 11.9% IA Home Page reddit.com 10.2% Social News Web Site google.TLD 9.9% Search Engine info-poland.buffalo.edu 1.5% Polish Studies de.wikipedia.org 1.4% Wikipedia cracked.com 1.2% Humor Site snopes.com 1.1% Urban Legends Reference Pages facebook.com 0.9% Social Media crochetpatterncentral.com 0.9% Crocheting Hobbies Access
  • 55. ArcLink: Optimization Techniques To Build And Retrieve The Temporal Web Graph A. AlSum, M. L. Nelson IIPC GA 2013, Ljubljana, Slovenia In Proceedings of the 13th international ACM/IEEE joint conference on Digital libraries, JCDL '13, 2013 See also: http://arxiv.org/abs/1305.5959
  • 56. Easy Solved Questions Q: What are the available mementos for vancouver2010.com? Access
  • 57. Solved Questions, but hard Q: What are the HTML titles for vancouver2010com through time? A Page scraping for all mementos Access
  • 58. Impossible Questions Q What are the anchor-text that pointed to www.vancouver2010.com through time? Access … <a href=www.vancouver2010.com > Vancouver Olympics </a> …. … <a href=www.vancouver2010.com > Winter Olympics </a> … … <a href=www.vancouver2010.com > Vancouver 2010 </a> …
  • 60. Impossible Questions • Q What are the anchor-text that pointed to www.vancouver2010.com through time? Access
  • 61. Thumbnail Summarization Techniques For Web Archives A. AlSum, and M. L. Nelson Submitted for publication.
  • 63. Thumbnail Creation Challenges • Scalability in Time • IA may need 361 years to create thumbnail per each memento using one hundred machine • Scalability in Space • IA will need 355 TB to store 1 thumbnail per each memento • Page quality Access
  • 64. How many thumbnails do we need? Access www.unfi.com on the live Web
  • 65. How many thumbnails do we need? Access www.unfi.com on the live Web
  • 66. 40 Thumbnails are good. Access
  • 67. Same technique applied to apple.com Access
  • 68. From 8000 Mementos to 69 Thumbnails. Access
  • 70. Community • I suggest to be a member in IIPC • Join the open Wayback Machine team • Join the Winter Olympics 2014 collaborative project, even as an observer
  • 71. Community • Web Archiving Workshops WAC 2011, Ottawa, Canada WAC 2012, Stanford, CA, USA WADL 2013, Indianapolis, IN, USATempWeb 2013, Rio de Janeiro, Brazil
  • 72. Tools to SUL Web Archive • Selection • Harvest • Analysis • Access
  • 73. Conclusions • Be Selective: Cover missing parts of the Web • Be Older: Include WebBase • Be Smart: Innovative services • Be Helpful: Researcher Framework/Dataset • Be Active: Participate in the WA communities • Make a difference aalsum@cs.odu.edu @aalsum
  • 75. What is missing? IA Internet Archive CAN Library and Archives Canada PO Portuguese Web Archive CZ Archive of the Czech Web LoC Library of Congress BL British Library CAT Web Archive of Catalonia TW National Taiwan University IC Icelandic Web Archive UK UK National Library CR Croatian Web Archive AIT Archive It
  • 76. Thumbnail Features SimHash DOM tree Embedded resources Datetime

Hinweis der Redaktion

  1. The notable exceptions of Japanese  {Bengali, Vietnamese} and German Portuguese