Archival Rates of Web URIs

•Download as PPTX, PDF•

3 likes•3,537 views

Ahmed AlSum

This is the presentation that given on JCDL 2011, Ottawa, Canada on "How Much of the Web is Archived?"

Education

How Much of the Web is Archived?* Scott Ainsworth, Ahmed AlSum, Hany SalahEldeen, Michele C. Weigle, Michael L. Nelson Old Dominion University, USA {sainswor, aalsum, hany, mweigle, mln}@cs.odu.edu JCDL 2011, Ottawa, Canada *This work supported in part by the Library of Congress and NSF IIS-1009392.

How Many Archive Copies? http://www.cnn.com +8000 copies http://cs.odu.edu/~aalsum

Research Question How Much of the Web is Archived?

4 Sample sets – 1000 URIs each Experiment Sampling Techniques 5

Sampling Techniques DMOZ ,[object Object],100 snapshots made from July 20, 2000 through October 3, 2010 ,[object Object]

Sampling Techniques Delicious ,[object Object],7

Sampling Techniques Bitly ,[object Object]

Random hash values were created and dereferenced until 1,000 target URIs were discovered8

Sampling Techniques Search Engine ,[object Object]

The phrase pool was selected from the 5-grams in Google’s N-gram data.

A random sample of these queries was used to obtain URIs, of which 1,000 were selected at random.9

For each URI, Memento Aggregator responded with TimeMap for this URI.Example<http://memento.waybackmachine.org/memento/20010819194233/http://jcdl2002.org>;rel="first memento";datetime="Sun, 19 Aug 2001 19:42:33 GMT“, <http://memento.waybackmachine.org/memento/20011216220248/http://jcdl2002.org>; rel="memento"; datetime="Sun, 16 Dec 2001 22:02:48 GMT",

We have 3 categories of archives Internet Archive (classic interface) Search engine Other archives Archive categories UK US 12

• Internet Archive •Search Engine • Others Archives Memento Distribution, ordered by the first observation date.

Memento Distribution, ordered by the first observation date.

(14.6% archived >= once per month) How often are URIs archived?

Similar to Archival Rates of Web URIs

The Off-Topic Memento ToolkitShawn Jones

Summarizing archival collections using storytelling techniquesMichael Nelson

More Archives, More Better Michael Nelson

Who Will Archive the Archives? Thoughts About the Future of Web ArchivingMichael Nelson

Nelson, Michael: Summarizing Archival Collections Using Storytelling TechniquesReynolds Journalism Institute (RJI)

Toward Semantic Sensor Data Archives on the WebJean-Paul Calbimonte

The Many Shapes of Archive-ItShawn Jones

10-31-13 “Researcher Perspectives of Data Curation” Presentation SlidesDuraSpace

ArcLink - IIPC GA 2013Ahmed AlSum

Web Archiving Profile - WADL 2013Ahmed AlSum

Web 2.0 for Archivists, Powerpoint VersionArian Ravanbakhsh

Improving Understanding of Web Archive Collections Through Storytelling - PhD...Shawn Jones

Web archiving challenges and opportunitiesAhmed AlSum

Detecting Off-Topic Web Pages at #CUWARCMichele Weigle

Managing Software Selection and Acquisition: From Problem to Solutionsuyu22

The web is a mess: how I learnt to stop worrying and love web archiving. Kris...Biblioteca Nacional de España

Why We Need Multiple ArchivesMichael Nelson

Richard Rogers' slides from Digital Conversations event on 26/09/2013Digital Research and Curator Team @ British Library

Improving Collection Understanding in Web ArchivesShawn Jones

Blockchain Can Not Be Used To Verify Replayed Archived Web PagesMichael Nelson

Similar to Archival Rates of Web URIs (20)

The Off-Topic Memento Toolkit

Summarizing archival collections using storytelling techniques

More Archives, More Better

Who Will Archive the Archives? Thoughts About the Future of Web Archiving

Nelson, Michael: Summarizing Archival Collections Using Storytelling Techniques

Toward Semantic Sensor Data Archives on the Web

The Many Shapes of Archive-It

10-31-13 “Researcher Perspectives of Data Curation” Presentation Slides

ArcLink - IIPC GA 2013

Web Archiving Profile - WADL 2013

Web 2.0 for Archivists, Powerpoint Version

Improving Understanding of Web Archive Collections Through Storytelling - PhD...

Web archiving challenges and opportunities

Detecting Off-Topic Web Pages at #CUWARC

Managing Software Selection and Acquisition: From Problem to Solution

The web is a mess: how I learnt to stop worrying and love web archiving. Kris...

Why We Need Multiple Archives

Richard Rogers' slides from Digital Conversations event on 26/09/2013

Improving Collection Understanding in Web Archives

Blockchain Can Not Be Used To Verify Replayed Archived Web Pages

Recently uploaded

Q4 English4 Week3 PPT Melcnmg-based.pptxnelietumpap1

ENGLISH6-Q4-W3.pptxqurter our high choomnelietumpap1

Computed Fields and api Depends in the Odoo 17Celine George

Choosing the Right CBSE School A Comprehensive Guide for Parentsnavabharathschool99

ECONOMIC CONTEXT - LONG FORM TV DRAMA - PPTiammrhaywood

YOUVE GOT EMAIL_FINALS_EL_DORADO_2024.pptxConquiztadors- the Quiz Society of Sri Venkateswara College

YOUVE_GOT_EMAIL_PRELIMS_EL_DORADO_2024.pptxConquiztadors- the Quiz Society of Sri Venkateswara College

Difference Between Search & Browse Methods in Odoo 17Celine George

Incoming and Outgoing Shipments in 3 STEPS Using Odoo 17Celine George

ENGLISH 7_Q4_LESSON 2_ Employing a Variety of Strategies for Effective Interp...JhezDiaz1

TataKelola dan KamSiber Kecerdasan Buatan v022.pdfSarwono Sutikno, Dr.Eng.,CISA,CISSP,CISM,CSX-F

DATA STRUCTURE AND ALGORITHM for beginnersSabitha Banu

MULTIDISCIPLINRY NATURE OF THE ENVIRONMENTAL STUDIES.pptxAnupkumar Sharma

What is Model Inheritance in Odoo 17 ERPCeline George

4.18.24 Movement Legacies, Reflection, and Review.pptxmary850239

Proudly South Africa powerpoint Thorisha.pptxthorishapillay1

OS-operating systems- ch04 (Threads) ...Dr. Mazin Mohamed alkathiri

Like-prefer-love -hate+verb+ing & silent letters & citizenship text.pdfMr Bounab Samir

Full Stack Web Development Course for BeginnersSabitha Banu

Inclusivity Essentials_ Creating Accessible Websites for Nonprofits .pdfTechSoup

Recently uploaded (20)

Q4 English4 Week3 PPT Melcnmg-based.pptx

ENGLISH6-Q4-W3.pptxqurter our high choom

Computed Fields and api Depends in the Odoo 17

Choosing the Right CBSE School A Comprehensive Guide for Parents

ECONOMIC CONTEXT - LONG FORM TV DRAMA - PPT

YOUVE GOT EMAIL_FINALS_EL_DORADO_2024.pptx

YOUVE_GOT_EMAIL_PRELIMS_EL_DORADO_2024.pptx

Difference Between Search & Browse Methods in Odoo 17

Incoming and Outgoing Shipments in 3 STEPS Using Odoo 17

ENGLISH 7_Q4_LESSON 2_ Employing a Variety of Strategies for Effective Interp...

TataKelola dan KamSiber Kecerdasan Buatan v022.pdf

DATA STRUCTURE AND ALGORITHM for beginners

MULTIDISCIPLINRY NATURE OF THE ENVIRONMENTAL STUDIES.pptx

What is Model Inheritance in Odoo 17 ERP

4.18.24 Movement Legacies, Reflection, and Review.pptx

Proudly South Africa powerpoint Thorisha.pptx

OS-operating systems- ch04 (Threads) ...

Like-prefer-love -hate+verb+ing & silent letters & citizenship text.pdf

Full Stack Web Development Course for Beginners

Inclusivity Essentials_ Creating Accessible Websites for Nonprofits .pdf

Archival Rates of Web URIs

1. How Much of the Web is Archived?* Scott Ainsworth, Ahmed AlSum, Hany SalahEldeen, Michele C. Weigle, Michael L. Nelson Old Dominion University, USA {sainswor, aalsum, hany, mweigle, mln}@cs.odu.edu JCDL 2011, Ottawa, Canada *This work supported in part by the Library of Congress and NSF IIS-1009392.

2. How Much of the Web is Archived?* Scott Ainsworth, Ahmed AlSum, Hany SalahEldeen, Michele C. Weigle, Michael L. Nelson Old Dominion University, USA {sainswor, aalsum, hany, mweigle, mln}@cs.odu.edu JCDL 2011, Ottawa, Canada *This work supported in part by the Library of Congress and NSF IIS-1009392.

3. How Many Archive Copies? http://www.cnn.com +8000 copies http://cs.odu.edu/~aalsum

4. Research Question How Much of the Web is Archived?

5. 4 Sample sets – 1000 URIs each Experiment Sampling Techniques 5

7. Randomly selected 1000 URIs6

10. Random hash values were created and dereferenced until 1,000 target URIs were discovered8

11.

12. The phrase pool was selected from the 5-grams in Google’s N-gram data.

13. A random sample of these queries was used to obtain URIs, of which 1,000 were selected at random.9

14.

15. For each URI, Memento Aggregator responded with TimeMap for this URI.Example<http://memento.waybackmachine.org/memento/20010819194233/http://jcdl2002.org>;rel="first memento";datetime="Sun, 19 Aug 2001 19:42:33 GMT“, <http://memento.waybackmachine.org/memento/20011216220248/http://jcdl2002.org>; rel="memento"; datetime="Sun, 16 Dec 2001 22:02:48 GMT",

16. Results 65% 63% 11

17. We have 3 categories of archives Internet Archive (classic interface) Search engine Other archives Archive categories UK US 12

18. • Internet Archive •Search Engine • Others Archives Memento Distribution, ordered by the first observation date.

19. Memento Distribution, ordered by the first observation date.

20. (14.6% archived >= once per month) How often are URIs archived?

21. (14.7% archived >= once per month) (31% archived >= once per month) (28.7% archived >= once per month) (14.6% archived >= once per month) How often are URIs archived?

22. What affects the archival rate? URI source (More popular – More archival) BackLinks (Weak positive relationship) Archive Coverage Internet Archive (Best) Search Engine (Good), but only one copy. The other archives cover only a small fraction with recent copies. Analysis 17

23. Tell me what is your URI source!! Conclusion How much of the Web is Archived? 18 Contact: Ahmed AlSum (aalsum@cs.odu.edu)

24. Ziv Bar-Yossef and Maxim Gurevich. Random sampling from a search engine’s index. J. ACM, 55(5), 2008. Frank McCown and Michael L. Nelson. Characterization of search engine caches. In Proceedings of IS&T Archiving 2007, May 2007. Herbert Van de Sompel, Michael Nelson, and Robert Sanderson. HTTP framework for time-based access to resource states — Memento, November 2010. http://datatracker.ietf.org/doc/draft-vandesompelmemento/. References 19

25. Thank You Contact: Ahmed AlSum aalsum@cs.odu.edu

Archival Rates of Web URIs

Recommended

Recommended

More Related Content

Similar to Archival Rates of Web URIs

Similar to Archival Rates of Web URIs (20)

Recently uploaded

Recently uploaded (20)

Archival Rates of Web URIs