SlideShare ist ein Scribd-Unternehmen logo
1 von 43
Downloaden Sie, um offline zu lesen
Profiling Web ArchivesProfiling Web Archives
Michael L. Nelson
Ahmed AlSum, Michele C. Weigle
Herbert Van de Sompel, David Rosenthal
IIPC General Assembly
Paris, France, May 21, 2014
1
Profiling Web Archives
Profiling Web Archives
Profiling Web Archives
Profiling Web Archives
Where's that issue
with the Afghan girl?
7
8
9
Prior IIPC Memento Aggregator ProjectPrior IIPC Memento Aggregator Project
• Ten IIPC archives, led by LANL
• Conceived at 2011 IIPC meeting
• Results reported at 2012 IIPC meeting
o http://netpreserve.org/sites/default/files/resources/Sanderson.pdf
• Two highlights:
Profiling Web Archives
Profiling Web Archives
Stop and Rethink…Stop and Rethink…
• LANL's processing was informative from a
"big data" perspective, but was neither
scalable nor sustainable
o "send us your CDX" == hard for both parties
o there are lots of URIs in the world
• Will only get worse with:
o more archives…
o …doing more archiving
Leverage Memento AggregatorsLeverage Memento Aggregators
• Memento aggregator currently broadcast
URI lookups to all known archives
• New approach:
1. build profiles based on sampling from URI lookups
(optionally supplement with CDX files when available)
2. Use archive profiles for informing Memento
aggregator "query routing" decisions
3. Share serialized profiles with other IIPC partners
http://mementoproxy.lanl.gov/aggr/timemap/link/1/http://www.bnf.fr/
Profiling StudiesProfiling Studies
• TPDL 2013
o 12 archives, March 2013, public web archives used
but techniques apply generally
o sampling only, no CDX access
• IJDL 2014 (to appear)
o 15 archives (+4, -1), October 2013
o slightly larger sample URI dataset
o results similar
URI Lookup = Limited InformationURI Lookup = Limited Information
16
GET /aggr/timegate/http://www.bnf.fr/ HTTP/1.1
Host: mementoproxy.lanl.gov
Accept-Datetime: Sun, 29 May 2005 02:46:53 GMT
Accept-Language: fr; q=1.0, en; q=0.5
…
1. Original URI
2. Memento-Datetime
3. Preferred URI
2
1
3
Where to find Mementos for …Where to find Mementos for …
17
http://www.japantimes.co.jp/
Where to find Mementos for …Where to find Mementos for …
18
http://www.japantimes.co.jp/
Where to find Mementos for …Where to find Mementos for …
19
http://www.bnf.fr
Where to find Mementos for …Where to find Mementos for …
20
http://www.bnf.fr
Research QuestionResearch Question
Problem
• Profile public web archives according to the following
dimensions:
o Top-level domains
o Languages
o Growth rate
o Archival date
Motivation
• Determine who is archiving what
• Optimize query routing for a Memento Aggregator
21
Web Archives in this ExperimentWeb Archives in this Experiment
Full text URI-lookup
Internet Archive √
Library of Congress √
Icelandic Web Archive √
Library and Archives Canada √ √
British Library √ √
UK National Library √ √
Portuguese Web Archive √ √
Web Archive of Catalonia √ √
Croatian Web Archive √ √
Archive of the Czech Web √ √
National Taiwan University √ √
Archive It √ √
22
Experiment Set UpExperiment Set Up
• Sample URIs from seven different sources
• Retrieve the TimeMap for each URI from all archives
o A TimeMap lists all Mementos for a given URI
o A Memento is an archived version of a resource
• Analyze who has holdings for which URIs
23
Sampling URIs - DMOZSampling URIs - DMOZ
1. DMOZ:Random
o 10,000 URIs randomly sampled from DMOZ directory (~5M URIs).
2. DMOZ:TLD - 2% for each TLD from DMOZ or 100 URIs
whichever is greater
o 52 TLDs (com 23,470) (de 6,332), (org 4,025), (uk 3,309), (net
2,073), (it 1,775), (jp 1379), (ru 1244), (fr 1154), (pl 1062), (au
764), (ca 642), (at 438), (edu 390), (cz 385), (tr 334), (info 319),
(cn 278), (us 266), (nz 265), (es 238), (ar 213), (no 150), (br 149),
(tw 141), (za 118), (fi 113), ( 100 URIs for [ae, cat, cl, cu, eg, gov,
id, in, ir, is, ke, kr, ma, mt, mx, my, na, pe, pk, pt, sa, to, uy,
zw])
3. DMOZ:Languages - 100 URIs for each language
1. 24 languages: Icelandic, Portuguese, Catalan, Afrikaans,
Arabic, Indonesian, Chinese (Simplified), Chinese (Traditional),
Dutch, Spanish, French, Greek, Hindi, Italian, Japanese,
Korean, Norwegian, Persian, Polish , Russian, Turkish,
Ukrainian 24
• Query the fulltext search interface of select web archives
with two sets of query terms.
4. Top 1-Gram from Bing
o Most are English
4. Top 1000 query terms from Yahoo in 9 languages
o Excluding general keywords such as: Obama, Facebook.
25
Sampling URIs – Web Archives Full TextSampling URIs – Web Archives Full Text
26
Sampling URIs – Web Archives Full TextSampling URIs – Web Archives Full Text
27
Sampling URIs – Web Archives Full TextSampling URIs – Web Archives Full Text
Sampling URIs – User RequestsSampling URIs – User Requests
• Sampling from user requests for archived web resources
6. Sample from IA Wayback Machine Log files
o 1,000 URIs randomly sampled from Feb 22, 2012 to Feb 26,
2012.
6. Sample from Memento Aggregator log files
o 100 URIs randomly sampled from LANL Memento Aggregator
between 2011 to 2013.
28
Archive Coverage per SampleArchive Coverage per Sample
29
1
0
0
%
3
5
%
Entire Sample
TLD Coverage across Archives (1)TLD Coverage across Archives (1)
30
Entire Sample
TLD Coverage across Archives (2)TLD Coverage across Archives (2)
31
Entire Sample
TLD Distribution per ArchiveTLD Distribution per Archive
32
DMOZ:TLD Sample
TLD Distribution per ArchiveTLD Distribution per Archive
33
Web Archives Full Text Sample
Language Coverage per ArchiveLanguage Coverage per Archive
34
DMOZ Sample
Archive Growth RateArchive Growth Rate
35
Entire Sample
Query Routing EvaluationQuery Routing Evaluation
36
Study ResultsStudy Results
• Introduced sampling to profile web archives using
available infrastructure, no privileged access
• Coverage:
o Internet Archive provides broad coverage
o National archives have good coverage for their domains
o Surprising coverage by certain archives
• Query Routing:
o In 84% of the cases, all existing Mementos for a TLD can be
found by using IA and two additional top archives for a TLD
o In 55% of the cases, all existing Mementos for a TLD can be
found by using the top 3 archives for a TLD, excluding IA
37
Next Steps With the IIPCNext Steps With the IIPC
38
• Finding the right granularity
o too fine:
http://www.bnf.fr/fr/evenements_et_culture/a.passe_bnf.html
o too coarse: .fr
o just right?: bnf.fr, www.bnf.fr, gallica.bnf.fr, www.bnf.fr/fr/
• Generating profiles
o what are desirable / representative sample sets: domains,
languages, regions, etc. -- what's missing?
o local CDX analysis tools (can help with cold start problem)
• Profile format
o community input (yet another metadata format)
o github (or other tools) for exchange & integration
{"Profile":{
"Name":"Taiwan Web Archive",
"URI":"http://webarchive.lib.ntu.edu.tw",
"TimeGate":
"http://mementoproxy.cs.odu.edu/tw/timegate/",
"Code":"TW",
"Age":"Tue, 15 Jul 1997 00:00:00 GMT",
"TLD":[{"tw":0.6},{"cn":0.08},{"hk":0.04},
{"eg":0.04},{"gov":0.04},{"my":0.04},
{"jp":0.04},{"kr":0.02}],
"Language":[{"zh-TW":0.5},{"zh-CN":0.25},
{"id":0.08},{"ar":0.08}],
"GrowthRate":[
{"199707":[4,4]},{"200202":[1,1]},
{"200607":[30,62]},{"200608":[20,80]},
{"200609":[5,9]},{"200612":[77,129]},
... // other values truncated
{"201308":[7,94]},{"201309":[2,94]}]
}
}
A Possible SerializationA Possible Serialization
Profiling Web Archives
Profiling Web Archives
{Light, Dim, Dark} Archives{Light, Dim, Dark} Archives
42
• Work to date has assumed light archives
because our focus has been on sampling
archives we don't control
• Applicable to a continuum of archives:
o download/fork and run "dark-sample.py"
o it accesses sample URIs from IIPC github
o issues URI lookups to local archive
o write/update your archive profile in IIPC github with machine
readable IP restrictions
o all profiles -- light/dim/dark -- now available to Memento
aggregators and other IIPC analysis tools
Profiles = Easy Discovery, SharingProfiles = Easy Discovery, Sharing
http://netpreserve.org/aggr/timemap/link/1/http://www.bnf.fr/

Weitere ähnliche Inhalte

Was ist angesagt?

iPRES2015: Archiving Deferred Representations Using a Two-Tiered Crawling App...
iPRES2015: Archiving Deferred Representations Using a Two-Tiered Crawling App...iPRES2015: Archiving Deferred Representations Using a Two-Tiered Crawling App...
iPRES2015: Archiving Deferred Representations Using a Two-Tiered Crawling App...Justin Brunelle
 
Scripts in a Frame: A Two-Tiered Approach for Archiving Deferred Representations
Scripts in a Frame: A Two-Tiered Approach for Archiving Deferred RepresentationsScripts in a Frame: A Two-Tiered Approach for Archiving Deferred Representations
Scripts in a Frame: A Two-Tiered Approach for Archiving Deferred RepresentationsJustin Brunelle
 
Web@rchive Austria (Archiving Online Media)
Web@rchive Austria (Archiving Online Media)Web@rchive Austria (Archiving Online Media)
Web@rchive Austria (Archiving Online Media)Web@rchive Austria
 
Viaf and isni ifla 2013 08-16
Viaf and isni  ifla 2013 08-16Viaf and isni  ifla 2013 08-16
Viaf and isni ifla 2013 08-16Janifer Gatenby
 
Interoperability for web based scholarship
Interoperability for web based scholarshipInteroperability for web based scholarship
Interoperability for web based scholarshipHerbert Van de Sompel
 
Summarizing archival collections using storytelling techniques
Summarizing archival collections using storytelling techniquesSummarizing archival collections using storytelling techniques
Summarizing archival collections using storytelling techniquesMichael Nelson
 
FAIR Signposting: A KISS Approach to a Burning Issue
FAIR Signposting: A KISS Approach to a Burning IssueFAIR Signposting: A KISS Approach to a Burning Issue
FAIR Signposting: A KISS Approach to a Burning IssueHerbert Van de Sompel
 
InterPlanetary Wayback: Peer-To-Peer Permanence of Web Archives
InterPlanetary Wayback: Peer-To-Peer Permanence of Web ArchivesInterPlanetary Wayback: Peer-To-Peer Permanence of Web Archives
InterPlanetary Wayback: Peer-To-Peer Permanence of Web ArchivesSawood Alam
 
A Perspective on Archiving the Scholarly Record
A Perspective on Archiving the Scholarly RecordA Perspective on Archiving the Scholarly Record
A Perspective on Archiving the Scholarly RecordHerbert Van de Sompel
 
Intro to Linked Open Data in Libraries Archives & Museums.
Intro to Linked Open Data in Libraries Archives & Museums.Intro to Linked Open Data in Libraries Archives & Museums.
Intro to Linked Open Data in Libraries Archives & Museums.Jon Voss
 
MementoMap Framework for Flexible and Adaptive Web Archive Profiling
MementoMap Framework for Flexible and Adaptive Web Archive ProfilingMementoMap Framework for Flexible and Adaptive Web Archive Profiling
MementoMap Framework for Flexible and Adaptive Web Archive ProfilingSawood Alam
 
Signposting for Repositories
Signposting for RepositoriesSignposting for Repositories
Signposting for RepositoriesMartin Klein
 
OAC Presentation at CNI 09 Fall Forum
OAC Presentation at CNI 09 Fall ForumOAC Presentation at CNI 09 Fall Forum
OAC Presentation at CNI 09 Fall ForumRobert Sanderson
 
How to read a million books?
How to read a million books?How to read a million books?
How to read a million books?cneudecker
 
To the Rescue of the Orphans of Scholarly Communication
To the Rescue of the Orphans of Scholarly CommunicationTo the Rescue of the Orphans of Scholarly Communication
To the Rescue of the Orphans of Scholarly CommunicationMartin Klein
 
What is #LODLAM?! Understanding linked open data in libraries, archives [and ...
What is #LODLAM?! Understanding linked open data in libraries, archives [and ...What is #LODLAM?! Understanding linked open data in libraries, archives [and ...
What is #LODLAM?! Understanding linked open data in libraries, archives [and ...Alison Hitchens
 
What’s in a URL? Analysing COVID-19 web archive collections
What’s in a URL? Analysing COVID-19 web archive collectionsWhat’s in a URL? Analysing COVID-19 web archive collections
What’s in a URL? Analysing COVID-19 web archive collectionsWARCnet
 

Was ist angesagt? (20)

Creating Pockets of Persistence
Creating Pockets of PersistenceCreating Pockets of Persistence
Creating Pockets of Persistence
 
iPRES2015: Archiving Deferred Representations Using a Two-Tiered Crawling App...
iPRES2015: Archiving Deferred Representations Using a Two-Tiered Crawling App...iPRES2015: Archiving Deferred Representations Using a Two-Tiered Crawling App...
iPRES2015: Archiving Deferred Representations Using a Two-Tiered Crawling App...
 
Scripts in a Frame: A Two-Tiered Approach for Archiving Deferred Representations
Scripts in a Frame: A Two-Tiered Approach for Archiving Deferred RepresentationsScripts in a Frame: A Two-Tiered Approach for Archiving Deferred Representations
Scripts in a Frame: A Two-Tiered Approach for Archiving Deferred Representations
 
Web@rchive Austria (Archiving Online Media)
Web@rchive Austria (Archiving Online Media)Web@rchive Austria (Archiving Online Media)
Web@rchive Austria (Archiving Online Media)
 
Viaf and isni ifla 2013 08-16
Viaf and isni  ifla 2013 08-16Viaf and isni  ifla 2013 08-16
Viaf and isni ifla 2013 08-16
 
Interoperability for web based scholarship
Interoperability for web based scholarshipInteroperability for web based scholarship
Interoperability for web based scholarship
 
Summarizing archival collections using storytelling techniques
Summarizing archival collections using storytelling techniquesSummarizing archival collections using storytelling techniques
Summarizing archival collections using storytelling techniques
 
FAIR Signposting: A KISS Approach to a Burning Issue
FAIR Signposting: A KISS Approach to a Burning IssueFAIR Signposting: A KISS Approach to a Burning Issue
FAIR Signposting: A KISS Approach to a Burning Issue
 
InterPlanetary Wayback: Peer-To-Peer Permanence of Web Archives
InterPlanetary Wayback: Peer-To-Peer Permanence of Web ArchivesInterPlanetary Wayback: Peer-To-Peer Permanence of Web Archives
InterPlanetary Wayback: Peer-To-Peer Permanence of Web Archives
 
PID Signposting Pattern
PID Signposting PatternPID Signposting Pattern
PID Signposting Pattern
 
A Perspective on Archiving the Scholarly Record
A Perspective on Archiving the Scholarly RecordA Perspective on Archiving the Scholarly Record
A Perspective on Archiving the Scholarly Record
 
Intro to Linked Open Data in Libraries Archives & Museums.
Intro to Linked Open Data in Libraries Archives & Museums.Intro to Linked Open Data in Libraries Archives & Museums.
Intro to Linked Open Data in Libraries Archives & Museums.
 
MementoMap Framework for Flexible and Adaptive Web Archive Profiling
MementoMap Framework for Flexible and Adaptive Web Archive ProfilingMementoMap Framework for Flexible and Adaptive Web Archive Profiling
MementoMap Framework for Flexible and Adaptive Web Archive Profiling
 
Signposting for Repositories
Signposting for RepositoriesSignposting for Repositories
Signposting for Repositories
 
OAC Presentation at CNI 09 Fall Forum
OAC Presentation at CNI 09 Fall ForumOAC Presentation at CNI 09 Fall Forum
OAC Presentation at CNI 09 Fall Forum
 
How to read a million books?
How to read a million books?How to read a million books?
How to read a million books?
 
To the Rescue of the Orphans of Scholarly Communication
To the Rescue of the Orphans of Scholarly CommunicationTo the Rescue of the Orphans of Scholarly Communication
To the Rescue of the Orphans of Scholarly Communication
 
What is #LODLAM?! Understanding linked open data in libraries, archives [and ...
What is #LODLAM?! Understanding linked open data in libraries, archives [and ...What is #LODLAM?! Understanding linked open data in libraries, archives [and ...
What is #LODLAM?! Understanding linked open data in libraries, archives [and ...
 
Linked Data Basics
Linked Data BasicsLinked Data Basics
Linked Data Basics
 
What’s in a URL? Analysing COVID-19 web archive collections
What’s in a URL? Analysing COVID-19 web archive collectionsWhat’s in a URL? Analysing COVID-19 web archive collections
What’s in a URL? Analysing COVID-19 web archive collections
 

Andere mochten auch

Profiling Web Archive Coverage for Top-Level Domain and Content Language
Profiling Web Archive Coverage for Top-Level Domain and Content LanguageProfiling Web Archive Coverage for Top-Level Domain and Content Language
Profiling Web Archive Coverage for Top-Level Domain and Content LanguageMichael Nelson
 
We Need Multiple, Independent Web Archives
We Need Multiple, Independent Web ArchivesWe Need Multiple, Independent Web Archives
We Need Multiple, Independent Web ArchivesMichael Nelson
 
Evaluating the Temporal Coherence of Archived Pages
Evaluating the Temporal Coherence of Archived PagesEvaluating the Temporal Coherence of Archived Pages
Evaluating the Temporal Coherence of Archived PagesMichael Nelson
 
Resurrecting My Revolutionsing Social Link Neighborhood in Bringing Context t...
Resurrecting My Revolutionsing Social Link Neighborhood in Bringing Context t...Resurrecting My Revolutionsing Social Link Neighborhood in Bringing Context t...
Resurrecting My Revolutionsing Social Link Neighborhood in Bringing Context t...Michael Nelson
 
Software as a Well-Formed Research Object
Software as a Well-Formed Research ObjectSoftware as a Well-Formed Research Object
Software as a Well-Formed Research ObjectYasmin AlNoamany, PhD
 
Old Dominion University Computer Science IIPC New Member
Old Dominion University Computer Science IIPC New Member Old Dominion University Computer Science IIPC New Member
Old Dominion University Computer Science IIPC New Member Michael Nelson
 
Using Web Archives to Enrich the Live Web Experience Through Storytelling
Using Web Archives to Enrich the Live Web Experience Through StorytellingUsing Web Archives to Enrich the Live Web Experience Through Storytelling
Using Web Archives to Enrich the Live Web Experience Through StorytellingYasmin AlNoamany, PhD
 
When Should I Make Preservation Copies of Myself?
When Should I Make Preservation Copies of Myself?�When Should I Make Preservation Copies of Myself?�
When Should I Make Preservation Copies of Myself?Michael Nelson
 
Evaluating the SiteStory Transactional Web Archive with the ApacheBench Tool
Evaluating the SiteStory Transactional Web Archive with the ApacheBench ToolEvaluating the SiteStory Transactional Web Archive with the ApacheBench Tool
Evaluating the SiteStory Transactional Web Archive with the ApacheBench ToolMichael Nelson
 
Who Will Archive the Archives? Thoughts About the Future of Web Archiving
Who Will Archive the Archives? Thoughts About the Future of Web ArchivingWho Will Archive the Archives? Thoughts About the Future of Web Archiving
Who Will Archive the Archives? Thoughts About the Future of Web ArchivingMichael Nelson
 
On the Change in Archivability of Websites Over Time
On the Change in Archivability of Websites Over TimeOn the Change in Archivability of Websites Over Time
On the Change in Archivability of Websites Over TimeMichael Nelson
 
Combining Storytelling and Web Archives
Combining Storytelling and Web ArchivesCombining Storytelling and Web Archives
Combining Storytelling and Web ArchivesMichael Nelson
 
More Archives, More Better
More Archives, More Better More Archives, More Better
More Archives, More Better Michael Nelson
 
Assessing the Quality of Web Archives
Assessing the Quality of Web ArchivesAssessing the Quality of Web Archives
Assessing the Quality of Web ArchivesMichael Nelson
 
@WebSciDL PhD Student Project Reviews August 5&6, 2015
@WebSciDL PhD Student Project Reviews August 5&6, 2015@WebSciDL PhD Student Project Reviews August 5&6, 2015
@WebSciDL PhD Student Project Reviews August 5&6, 2015Michael Nelson
 
Web Archiving: A Brief Introduction
Web Archiving: A Brief IntroductionWeb Archiving: A Brief Introduction
Web Archiving: A Brief IntroductionSawood Alam
 
Combining Heritrix and PhantomJS for Better Crawling of Pages with Javascript
Combining Heritrix and PhantomJS for Better Crawling of Pages with JavascriptCombining Heritrix and PhantomJS for Better Crawling of Pages with Javascript
Combining Heritrix and PhantomJS for Better Crawling of Pages with JavascriptMichael Nelson
 
Who and What Links to the Internet Archive
Who and What Links to the Internet ArchiveWho and What Links to the Internet Archive
Who and What Links to the Internet ArchiveMichael Nelson
 
Storytelling for Summarizing Collections in Web Archives
Storytelling for Summarizing Collections in Web ArchivesStorytelling for Summarizing Collections in Web Archives
Storytelling for Summarizing Collections in Web ArchivesMichael Nelson
 
Web Archiving Activities of ODU’s Web Science and Digital Library Research G...
Web Archiving Activities of ODU’s Web Science and Digital Library Research G...Web Archiving Activities of ODU’s Web Science and Digital Library Research G...
Web Archiving Activities of ODU’s Web Science and Digital Library Research G...Michael Nelson
 

Andere mochten auch (20)

Profiling Web Archive Coverage for Top-Level Domain and Content Language
Profiling Web Archive Coverage for Top-Level Domain and Content LanguageProfiling Web Archive Coverage for Top-Level Domain and Content Language
Profiling Web Archive Coverage for Top-Level Domain and Content Language
 
We Need Multiple, Independent Web Archives
We Need Multiple, Independent Web ArchivesWe Need Multiple, Independent Web Archives
We Need Multiple, Independent Web Archives
 
Evaluating the Temporal Coherence of Archived Pages
Evaluating the Temporal Coherence of Archived PagesEvaluating the Temporal Coherence of Archived Pages
Evaluating the Temporal Coherence of Archived Pages
 
Resurrecting My Revolutionsing Social Link Neighborhood in Bringing Context t...
Resurrecting My Revolutionsing Social Link Neighborhood in Bringing Context t...Resurrecting My Revolutionsing Social Link Neighborhood in Bringing Context t...
Resurrecting My Revolutionsing Social Link Neighborhood in Bringing Context t...
 
Software as a Well-Formed Research Object
Software as a Well-Formed Research ObjectSoftware as a Well-Formed Research Object
Software as a Well-Formed Research Object
 
Old Dominion University Computer Science IIPC New Member
Old Dominion University Computer Science IIPC New Member Old Dominion University Computer Science IIPC New Member
Old Dominion University Computer Science IIPC New Member
 
Using Web Archives to Enrich the Live Web Experience Through Storytelling
Using Web Archives to Enrich the Live Web Experience Through StorytellingUsing Web Archives to Enrich the Live Web Experience Through Storytelling
Using Web Archives to Enrich the Live Web Experience Through Storytelling
 
When Should I Make Preservation Copies of Myself?
When Should I Make Preservation Copies of Myself?�When Should I Make Preservation Copies of Myself?�
When Should I Make Preservation Copies of Myself?
 
Evaluating the SiteStory Transactional Web Archive with the ApacheBench Tool
Evaluating the SiteStory Transactional Web Archive with the ApacheBench ToolEvaluating the SiteStory Transactional Web Archive with the ApacheBench Tool
Evaluating the SiteStory Transactional Web Archive with the ApacheBench Tool
 
Who Will Archive the Archives? Thoughts About the Future of Web Archiving
Who Will Archive the Archives? Thoughts About the Future of Web ArchivingWho Will Archive the Archives? Thoughts About the Future of Web Archiving
Who Will Archive the Archives? Thoughts About the Future of Web Archiving
 
On the Change in Archivability of Websites Over Time
On the Change in Archivability of Websites Over TimeOn the Change in Archivability of Websites Over Time
On the Change in Archivability of Websites Over Time
 
Combining Storytelling and Web Archives
Combining Storytelling and Web ArchivesCombining Storytelling and Web Archives
Combining Storytelling and Web Archives
 
More Archives, More Better
More Archives, More Better More Archives, More Better
More Archives, More Better
 
Assessing the Quality of Web Archives
Assessing the Quality of Web ArchivesAssessing the Quality of Web Archives
Assessing the Quality of Web Archives
 
@WebSciDL PhD Student Project Reviews August 5&6, 2015
@WebSciDL PhD Student Project Reviews August 5&6, 2015@WebSciDL PhD Student Project Reviews August 5&6, 2015
@WebSciDL PhD Student Project Reviews August 5&6, 2015
 
Web Archiving: A Brief Introduction
Web Archiving: A Brief IntroductionWeb Archiving: A Brief Introduction
Web Archiving: A Brief Introduction
 
Combining Heritrix and PhantomJS for Better Crawling of Pages with Javascript
Combining Heritrix and PhantomJS for Better Crawling of Pages with JavascriptCombining Heritrix and PhantomJS for Better Crawling of Pages with Javascript
Combining Heritrix and PhantomJS for Better Crawling of Pages with Javascript
 
Who and What Links to the Internet Archive
Who and What Links to the Internet ArchiveWho and What Links to the Internet Archive
Who and What Links to the Internet Archive
 
Storytelling for Summarizing Collections in Web Archives
Storytelling for Summarizing Collections in Web ArchivesStorytelling for Summarizing Collections in Web Archives
Storytelling for Summarizing Collections in Web Archives
 
Web Archiving Activities of ODU’s Web Science and Digital Library Research G...
Web Archiving Activities of ODU’s Web Science and Digital Library Research G...Web Archiving Activities of ODU’s Web Science and Digital Library Research G...
Web Archiving Activities of ODU’s Web Science and Digital Library Research G...
 

Ähnlich wie Profiling Web Archives

Web Archiving Profile - WADL 2013
Web Archiving Profile - WADL 2013Web Archiving Profile - WADL 2013
Web Archiving Profile - WADL 2013Ahmed AlSum
 
ISSA: Generic Pipeline, Knowledge Model and Visualization tools to Help Scien...
ISSA: Generic Pipeline, Knowledge Model and Visualization tools to Help Scien...ISSA: Generic Pipeline, Knowledge Model and Visualization tools to Help Scien...
ISSA: Generic Pipeline, Knowledge Model and Visualization tools to Help Scien...Franck Michel
 
NHM Data Portal: first steps toward the Graph-of-Life
NHM Data Portal: first steps toward the Graph-of-LifeNHM Data Portal: first steps toward the Graph-of-Life
NHM Data Portal: first steps toward the Graph-of-LifeEdward Baker
 
NHM Data Portal: first steps toward the Graph-of-Life
NHM Data Portal: first steps toward the Graph-of-LifeNHM Data Portal: first steps toward the Graph-of-Life
NHM Data Portal: first steps toward the Graph-of-LifeVince Smith
 
Slides anu talkwebarchivingaug2012
Slides anu talkwebarchivingaug2012Slides anu talkwebarchivingaug2012
Slides anu talkwebarchivingaug2012Roxanne Missingham
 
Embracing Diversity: Searching over Multiple Languages - Suneel Marthi, Red H...
Embracing Diversity: Searching over Multiple Languages - Suneel Marthi, Red H...Embracing Diversity: Searching over Multiple Languages - Suneel Marthi, Red H...
Embracing Diversity: Searching over Multiple Languages - Suneel Marthi, Red H...Lucidworks
 
IIIF at europeana, IIIF conference, Vatican, 2017
IIIF at europeana, IIIF conference, Vatican, 2017IIIF at europeana, IIIF conference, Vatican, 2017
IIIF at europeana, IIIF conference, Vatican, 2017Nuno Freire
 
Web Archiving – Lessons and Potential
 Web Archiving – Lessons and Potential Web Archiving – Lessons and Potential
Web Archiving – Lessons and PotentialDaniel Gomes
 
Metadata Aggregation: Assessing the Application of IIIF and Sitemaps within C...
Metadata Aggregation: Assessing the Application of IIIF and Sitemaps within C...Metadata Aggregation: Assessing the Application of IIIF and Sitemaps within C...
Metadata Aggregation: Assessing the Application of IIIF and Sitemaps within C...Nuno Freire
 
Arcomem training Specifying Crawls Advanced
Arcomem training Specifying Crawls AdvancedArcomem training Specifying Crawls Advanced
Arcomem training Specifying Crawls Advancedarcomem
 
Arcomem training Specifying Crawls Beginners
Arcomem training Specifying Crawls BeginnersArcomem training Specifying Crawls Beginners
Arcomem training Specifying Crawls Beginnersarcomem
 
Finalrevc
FinalrevcFinalrevc
FinalrevcSUNCAT
 
Cosi Opac Tweaks
Cosi   Opac TweaksCosi   Opac Tweaks
Cosi Opac Tweaksdaveyp
 
Snrg2011 6.15.2.sta canney_suranofsky
Snrg2011 6.15.2.sta canney_suranofskySnrg2011 6.15.2.sta canney_suranofsky
Snrg2011 6.15.2.sta canney_suranofskykaran saini
 
Named Entity Recognition for Europeana Newspapers
Named Entity Recognition for Europeana NewspapersNamed Entity Recognition for Europeana Newspapers
Named Entity Recognition for Europeana Newspaperscneudecker
 
Digital Preservation Best Practices: Lessons Learned From Across the Pond
Digital Preservation Best Practices: Lessons Learned From Across the PondDigital Preservation Best Practices: Lessons Learned From Across the Pond
Digital Preservation Best Practices: Lessons Learned From Across the PondBenoit Pauwels
 
Digital Presentation Best Practices: Lessons Learned From Across the Pond
Digital Presentation Best Practices: Lessons Learned From Across the PondDigital Presentation Best Practices: Lessons Learned From Across the Pond
Digital Presentation Best Practices: Lessons Learned From Across the PondULB - Bibliothèques
 
Web archiving challenges and opportunities
Web archiving challenges and opportunitiesWeb archiving challenges and opportunities
Web archiving challenges and opportunitiesAhmed AlSum
 

Ähnlich wie Profiling Web Archives (20)

Internet content as research data
Internet content as research dataInternet content as research data
Internet content as research data
 
Web Archiving Profile - WADL 2013
Web Archiving Profile - WADL 2013Web Archiving Profile - WADL 2013
Web Archiving Profile - WADL 2013
 
ISSA: Generic Pipeline, Knowledge Model and Visualization tools to Help Scien...
ISSA: Generic Pipeline, Knowledge Model and Visualization tools to Help Scien...ISSA: Generic Pipeline, Knowledge Model and Visualization tools to Help Scien...
ISSA: Generic Pipeline, Knowledge Model and Visualization tools to Help Scien...
 
NHM Data Portal: first steps toward the Graph-of-Life
NHM Data Portal: first steps toward the Graph-of-LifeNHM Data Portal: first steps toward the Graph-of-Life
NHM Data Portal: first steps toward the Graph-of-Life
 
NHM Data Portal: first steps toward the Graph-of-Life
NHM Data Portal: first steps toward the Graph-of-LifeNHM Data Portal: first steps toward the Graph-of-Life
NHM Data Portal: first steps toward the Graph-of-Life
 
Slides anu talkwebarchivingaug2012
Slides anu talkwebarchivingaug2012Slides anu talkwebarchivingaug2012
Slides anu talkwebarchivingaug2012
 
Embracing Diversity: Searching over Multiple Languages - Suneel Marthi, Red H...
Embracing Diversity: Searching over Multiple Languages - Suneel Marthi, Red H...Embracing Diversity: Searching over Multiple Languages - Suneel Marthi, Red H...
Embracing Diversity: Searching over Multiple Languages - Suneel Marthi, Red H...
 
IIIF at europeana, IIIF conference, Vatican, 2017
IIIF at europeana, IIIF conference, Vatican, 2017IIIF at europeana, IIIF conference, Vatican, 2017
IIIF at europeana, IIIF conference, Vatican, 2017
 
Web Archiving – Lessons and Potential
 Web Archiving – Lessons and Potential Web Archiving – Lessons and Potential
Web Archiving – Lessons and Potential
 
Metadata Aggregation: Assessing the Application of IIIF and Sitemaps within C...
Metadata Aggregation: Assessing the Application of IIIF and Sitemaps within C...Metadata Aggregation: Assessing the Application of IIIF and Sitemaps within C...
Metadata Aggregation: Assessing the Application of IIIF and Sitemaps within C...
 
Arcomem training Specifying Crawls Advanced
Arcomem training Specifying Crawls AdvancedArcomem training Specifying Crawls Advanced
Arcomem training Specifying Crawls Advanced
 
Lecture semantic augmentation
Lecture semantic augmentationLecture semantic augmentation
Lecture semantic augmentation
 
Arcomem training Specifying Crawls Beginners
Arcomem training Specifying Crawls BeginnersArcomem training Specifying Crawls Beginners
Arcomem training Specifying Crawls Beginners
 
Finalrevc
FinalrevcFinalrevc
Finalrevc
 
Cosi Opac Tweaks
Cosi   Opac TweaksCosi   Opac Tweaks
Cosi Opac Tweaks
 
Snrg2011 6.15.2.sta canney_suranofsky
Snrg2011 6.15.2.sta canney_suranofskySnrg2011 6.15.2.sta canney_suranofsky
Snrg2011 6.15.2.sta canney_suranofsky
 
Named Entity Recognition for Europeana Newspapers
Named Entity Recognition for Europeana NewspapersNamed Entity Recognition for Europeana Newspapers
Named Entity Recognition for Europeana Newspapers
 
Digital Preservation Best Practices: Lessons Learned From Across the Pond
Digital Preservation Best Practices: Lessons Learned From Across the PondDigital Preservation Best Practices: Lessons Learned From Across the Pond
Digital Preservation Best Practices: Lessons Learned From Across the Pond
 
Digital Presentation Best Practices: Lessons Learned From Across the Pond
Digital Presentation Best Practices: Lessons Learned From Across the PondDigital Presentation Best Practices: Lessons Learned From Across the Pond
Digital Presentation Best Practices: Lessons Learned From Across the Pond
 
Web archiving challenges and opportunities
Web archiving challenges and opportunitiesWeb archiving challenges and opportunities
Web archiving challenges and opportunities
 

Mehr von Michael Nelson

Web Archiving in the Year eaee1902f186819154789ee22ca30035
Web Archiving in the Year eaee1902f186819154789ee22ca30035Web Archiving in the Year eaee1902f186819154789ee22ca30035
Web Archiving in the Year eaee1902f186819154789ee22ca30035Michael Nelson
 
Uncertainty in replaying archived Twitter pages
Uncertainty in replaying archived Twitter pagesUncertainty in replaying archived Twitter pages
Uncertainty in replaying archived Twitter pagesMichael Nelson
 
Web Archives at the Nexus of Good Fakes and Flawed Originals
Web Archives at the Nexus of Good Fakes and Flawed OriginalsWeb Archives at the Nexus of Good Fakes and Flawed Originals
Web Archives at the Nexus of Good Fakes and Flawed OriginalsMichael Nelson
 
Web Archives at the Nexus of Good Fakes and Flawed Originals
Web Archives at the Nexus of Good Fakes and Flawed OriginalsWeb Archives at the Nexus of Good Fakes and Flawed Originals
Web Archives at the Nexus of Good Fakes and Flawed OriginalsMichael Nelson
 
Blockchain Can Not Be Used To Verify Replayed Archived Web Pages
Blockchain Can Not Be Used To Verify Replayed Archived Web PagesBlockchain Can Not Be Used To Verify Replayed Archived Web Pages
Blockchain Can Not Be Used To Verify Replayed Archived Web PagesMichael Nelson
 
Blockchain Can Not Be Used To Verify Replayed Archived Web Pages
Blockchain Can Not Be Used To Verify Replayed Archived Web PagesBlockchain Can Not Be Used To Verify Replayed Archived Web Pages
Blockchain Can Not Be Used To Verify Replayed Archived Web PagesMichael Nelson
 
Weaponized Web Archives: Provenance Laundering of Short Order Evidence
Weaponized Web Archives: Provenance Laundering of Short Order Evidence Weaponized Web Archives: Provenance Laundering of Short Order Evidence
Weaponized Web Archives: Provenance Laundering of Short Order Evidence Michael Nelson
 
Weaponized Web Archives: Provenance Laundering of Short Order Evidence
Weaponized Web Archives: Provenance Laundering of Short Order Evidence Weaponized Web Archives: Provenance Laundering of Short Order Evidence
Weaponized Web Archives: Provenance Laundering of Short Order Evidence Michael Nelson
 
Weaponized Web Archives: Provenance Laundering of Short Order Evidence
Weaponized Web Archives: Provenance Laundering of Short Order Evidence Weaponized Web Archives: Provenance Laundering of Short Order Evidence
Weaponized Web Archives: Provenance Laundering of Short Order Evidence Michael Nelson
 
Why Care About the Past?
Why Care About the Past?Why Care About the Past?
Why Care About the Past?Michael Nelson
 

Mehr von Michael Nelson (10)

Web Archiving in the Year eaee1902f186819154789ee22ca30035
Web Archiving in the Year eaee1902f186819154789ee22ca30035Web Archiving in the Year eaee1902f186819154789ee22ca30035
Web Archiving in the Year eaee1902f186819154789ee22ca30035
 
Uncertainty in replaying archived Twitter pages
Uncertainty in replaying archived Twitter pagesUncertainty in replaying archived Twitter pages
Uncertainty in replaying archived Twitter pages
 
Web Archives at the Nexus of Good Fakes and Flawed Originals
Web Archives at the Nexus of Good Fakes and Flawed OriginalsWeb Archives at the Nexus of Good Fakes and Flawed Originals
Web Archives at the Nexus of Good Fakes and Flawed Originals
 
Web Archives at the Nexus of Good Fakes and Flawed Originals
Web Archives at the Nexus of Good Fakes and Flawed OriginalsWeb Archives at the Nexus of Good Fakes and Flawed Originals
Web Archives at the Nexus of Good Fakes and Flawed Originals
 
Blockchain Can Not Be Used To Verify Replayed Archived Web Pages
Blockchain Can Not Be Used To Verify Replayed Archived Web PagesBlockchain Can Not Be Used To Verify Replayed Archived Web Pages
Blockchain Can Not Be Used To Verify Replayed Archived Web Pages
 
Blockchain Can Not Be Used To Verify Replayed Archived Web Pages
Blockchain Can Not Be Used To Verify Replayed Archived Web PagesBlockchain Can Not Be Used To Verify Replayed Archived Web Pages
Blockchain Can Not Be Used To Verify Replayed Archived Web Pages
 
Weaponized Web Archives: Provenance Laundering of Short Order Evidence
Weaponized Web Archives: Provenance Laundering of Short Order Evidence Weaponized Web Archives: Provenance Laundering of Short Order Evidence
Weaponized Web Archives: Provenance Laundering of Short Order Evidence
 
Weaponized Web Archives: Provenance Laundering of Short Order Evidence
Weaponized Web Archives: Provenance Laundering of Short Order Evidence Weaponized Web Archives: Provenance Laundering of Short Order Evidence
Weaponized Web Archives: Provenance Laundering of Short Order Evidence
 
Weaponized Web Archives: Provenance Laundering of Short Order Evidence
Weaponized Web Archives: Provenance Laundering of Short Order Evidence Weaponized Web Archives: Provenance Laundering of Short Order Evidence
Weaponized Web Archives: Provenance Laundering of Short Order Evidence
 
Why Care About the Past?
Why Care About the Past?Why Care About the Past?
Why Care About the Past?
 

Kürzlich hochgeladen

TORSION IN GASTROPODS- Anatomical event (Zoology)
TORSION IN GASTROPODS- Anatomical event (Zoology)TORSION IN GASTROPODS- Anatomical event (Zoology)
TORSION IN GASTROPODS- Anatomical event (Zoology)chatterjeesoumili50
 
Applied Biochemistry feedback_M Ahwad 2023.docx
Applied Biochemistry feedback_M Ahwad 2023.docxApplied Biochemistry feedback_M Ahwad 2023.docx
Applied Biochemistry feedback_M Ahwad 2023.docxmarwaahmad357
 
Human brain.. It's parts and function.
Human brain.. It's parts and function. Human brain.. It's parts and function.
Human brain.. It's parts and function. MUKTA MANJARI SAHOO
 
Gene transfer in plants agrobacterium.pdf
Gene transfer in plants agrobacterium.pdfGene transfer in plants agrobacterium.pdf
Gene transfer in plants agrobacterium.pdfNetHelix
 
Digitized Continuous Magnetic Recordings for the August/September 1859 Storms...
Digitized Continuous Magnetic Recordings for the August/September 1859 Storms...Digitized Continuous Magnetic Recordings for the August/September 1859 Storms...
Digitized Continuous Magnetic Recordings for the August/September 1859 Storms...Sérgio Sacani
 
SUKDANAN DIAGNOSTIC TEST IN PHYSICAL SCIENCE ANSWER KEYY.pdf
SUKDANAN DIAGNOSTIC TEST IN PHYSICAL SCIENCE ANSWER KEYY.pdfSUKDANAN DIAGNOSTIC TEST IN PHYSICAL SCIENCE ANSWER KEYY.pdf
SUKDANAN DIAGNOSTIC TEST IN PHYSICAL SCIENCE ANSWER KEYY.pdfsantiagojoderickdoma
 
Role of herbs in hair care Amla and heena.pptx
Role of herbs in hair care  Amla and  heena.pptxRole of herbs in hair care  Amla and  heena.pptx
Role of herbs in hair care Amla and heena.pptxVaishnaviAware
 
M.Pharm - Question Bank - Drug Delivery Systems
M.Pharm - Question Bank - Drug Delivery SystemsM.Pharm - Question Bank - Drug Delivery Systems
M.Pharm - Question Bank - Drug Delivery SystemsSumathi Arumugam
 
Exploration Method’s in Archaeological Studies & Research
Exploration Method’s in Archaeological Studies & ResearchExploration Method’s in Archaeological Studies & Research
Exploration Method’s in Archaeological Studies & ResearchPrachya Adhyayan
 
Krishi Vigyan Kendras - कृषि विज्ञान केंद्र
Krishi Vigyan Kendras - कृषि विज्ञान केंद्रKrishi Vigyan Kendras - कृषि विज्ञान केंद्र
Krishi Vigyan Kendras - कृषि विज्ञान केंद्रKrashi Coaching
 
Pests of ragi_Identification, Binomics_Dr.UPR
Pests of ragi_Identification, Binomics_Dr.UPRPests of ragi_Identification, Binomics_Dr.UPR
Pests of ragi_Identification, Binomics_Dr.UPRPirithiRaju
 
Legacy Analysis of Dark Matter Annihilation from the Milky Way Dwarf Spheroid...
Legacy Analysis of Dark Matter Annihilation from the Milky Way Dwarf Spheroid...Legacy Analysis of Dark Matter Annihilation from the Milky Way Dwarf Spheroid...
Legacy Analysis of Dark Matter Annihilation from the Milky Way Dwarf Spheroid...Sérgio Sacani
 
Pests of tenai_Identification,Binomics_Dr.UPR
Pests of tenai_Identification,Binomics_Dr.UPRPests of tenai_Identification,Binomics_Dr.UPR
Pests of tenai_Identification,Binomics_Dr.UPRPirithiRaju
 
Identification of Superclusters and Their Properties in the Sloan Digital Sky...
Identification of Superclusters and Their Properties in the Sloan Digital Sky...Identification of Superclusters and Their Properties in the Sloan Digital Sky...
Identification of Superclusters and Their Properties in the Sloan Digital Sky...Sérgio Sacani
 
Role of Herbs in Cosmetics in Cosmetic Science.
Role of Herbs in Cosmetics in Cosmetic Science.Role of Herbs in Cosmetics in Cosmetic Science.
Role of Herbs in Cosmetics in Cosmetic Science.ShwetaHattimare
 
Pests of cumbu_Identification, Binomics, Integrated ManagementDr.UPR.pdf
Pests of cumbu_Identification, Binomics, Integrated ManagementDr.UPR.pdfPests of cumbu_Identification, Binomics, Integrated ManagementDr.UPR.pdf
Pests of cumbu_Identification, Binomics, Integrated ManagementDr.UPR.pdfPirithiRaju
 
geometric quantization on coadjoint orbits
geometric quantization on coadjoint orbitsgeometric quantization on coadjoint orbits
geometric quantization on coadjoint orbitsHassan Jolany
 
biosynthesis of the cell wall and antibiotics
biosynthesis of the cell wall and antibioticsbiosynthesis of the cell wall and antibiotics
biosynthesis of the cell wall and antibioticsSafaFallah
 
Principles & Formulation of Hair Care Products
Principles & Formulation of Hair Care  ProductsPrinciples & Formulation of Hair Care  Products
Principles & Formulation of Hair Care Productspurwaborkar@gmail.com
 

Kürzlich hochgeladen (20)

TORSION IN GASTROPODS- Anatomical event (Zoology)
TORSION IN GASTROPODS- Anatomical event (Zoology)TORSION IN GASTROPODS- Anatomical event (Zoology)
TORSION IN GASTROPODS- Anatomical event (Zoology)
 
Applied Biochemistry feedback_M Ahwad 2023.docx
Applied Biochemistry feedback_M Ahwad 2023.docxApplied Biochemistry feedback_M Ahwad 2023.docx
Applied Biochemistry feedback_M Ahwad 2023.docx
 
Human brain.. It's parts and function.
Human brain.. It's parts and function. Human brain.. It's parts and function.
Human brain.. It's parts and function.
 
Gene transfer in plants agrobacterium.pdf
Gene transfer in plants agrobacterium.pdfGene transfer in plants agrobacterium.pdf
Gene transfer in plants agrobacterium.pdf
 
Digitized Continuous Magnetic Recordings for the August/September 1859 Storms...
Digitized Continuous Magnetic Recordings for the August/September 1859 Storms...Digitized Continuous Magnetic Recordings for the August/September 1859 Storms...
Digitized Continuous Magnetic Recordings for the August/September 1859 Storms...
 
SUKDANAN DIAGNOSTIC TEST IN PHYSICAL SCIENCE ANSWER KEYY.pdf
SUKDANAN DIAGNOSTIC TEST IN PHYSICAL SCIENCE ANSWER KEYY.pdfSUKDANAN DIAGNOSTIC TEST IN PHYSICAL SCIENCE ANSWER KEYY.pdf
SUKDANAN DIAGNOSTIC TEST IN PHYSICAL SCIENCE ANSWER KEYY.pdf
 
Role of herbs in hair care Amla and heena.pptx
Role of herbs in hair care  Amla and  heena.pptxRole of herbs in hair care  Amla and  heena.pptx
Role of herbs in hair care Amla and heena.pptx
 
M.Pharm - Question Bank - Drug Delivery Systems
M.Pharm - Question Bank - Drug Delivery SystemsM.Pharm - Question Bank - Drug Delivery Systems
M.Pharm - Question Bank - Drug Delivery Systems
 
Exploration Method’s in Archaeological Studies & Research
Exploration Method’s in Archaeological Studies & ResearchExploration Method’s in Archaeological Studies & Research
Exploration Method’s in Archaeological Studies & Research
 
Krishi Vigyan Kendras - कृषि विज्ञान केंद्र
Krishi Vigyan Kendras - कृषि विज्ञान केंद्रKrishi Vigyan Kendras - कृषि विज्ञान केंद्र
Krishi Vigyan Kendras - कृषि विज्ञान केंद्र
 
Pests of ragi_Identification, Binomics_Dr.UPR
Pests of ragi_Identification, Binomics_Dr.UPRPests of ragi_Identification, Binomics_Dr.UPR
Pests of ragi_Identification, Binomics_Dr.UPR
 
Legacy Analysis of Dark Matter Annihilation from the Milky Way Dwarf Spheroid...
Legacy Analysis of Dark Matter Annihilation from the Milky Way Dwarf Spheroid...Legacy Analysis of Dark Matter Annihilation from the Milky Way Dwarf Spheroid...
Legacy Analysis of Dark Matter Annihilation from the Milky Way Dwarf Spheroid...
 
Pests of tenai_Identification,Binomics_Dr.UPR
Pests of tenai_Identification,Binomics_Dr.UPRPests of tenai_Identification,Binomics_Dr.UPR
Pests of tenai_Identification,Binomics_Dr.UPR
 
Cheminformatics tools supporting dissemination of data associated with US EPA...
Cheminformatics tools supporting dissemination of data associated with US EPA...Cheminformatics tools supporting dissemination of data associated with US EPA...
Cheminformatics tools supporting dissemination of data associated with US EPA...
 
Identification of Superclusters and Their Properties in the Sloan Digital Sky...
Identification of Superclusters and Their Properties in the Sloan Digital Sky...Identification of Superclusters and Their Properties in the Sloan Digital Sky...
Identification of Superclusters and Their Properties in the Sloan Digital Sky...
 
Role of Herbs in Cosmetics in Cosmetic Science.
Role of Herbs in Cosmetics in Cosmetic Science.Role of Herbs in Cosmetics in Cosmetic Science.
Role of Herbs in Cosmetics in Cosmetic Science.
 
Pests of cumbu_Identification, Binomics, Integrated ManagementDr.UPR.pdf
Pests of cumbu_Identification, Binomics, Integrated ManagementDr.UPR.pdfPests of cumbu_Identification, Binomics, Integrated ManagementDr.UPR.pdf
Pests of cumbu_Identification, Binomics, Integrated ManagementDr.UPR.pdf
 
geometric quantization on coadjoint orbits
geometric quantization on coadjoint orbitsgeometric quantization on coadjoint orbits
geometric quantization on coadjoint orbits
 
biosynthesis of the cell wall and antibiotics
biosynthesis of the cell wall and antibioticsbiosynthesis of the cell wall and antibiotics
biosynthesis of the cell wall and antibiotics
 
Principles & Formulation of Hair Care Products
Principles & Formulation of Hair Care  ProductsPrinciples & Formulation of Hair Care  Products
Principles & Formulation of Hair Care Products
 

Profiling Web Archives

  • 1. Profiling Web ArchivesProfiling Web Archives Michael L. Nelson Ahmed AlSum, Michele C. Weigle Herbert Van de Sompel, David Rosenthal IIPC General Assembly Paris, France, May 21, 2014 1
  • 6. Where's that issue with the Afghan girl?
  • 7. 7
  • 8. 8
  • 9. 9
  • 10. Prior IIPC Memento Aggregator ProjectPrior IIPC Memento Aggregator Project • Ten IIPC archives, led by LANL • Conceived at 2011 IIPC meeting • Results reported at 2012 IIPC meeting o http://netpreserve.org/sites/default/files/resources/Sanderson.pdf • Two highlights:
  • 13. Stop and Rethink…Stop and Rethink… • LANL's processing was informative from a "big data" perspective, but was neither scalable nor sustainable o "send us your CDX" == hard for both parties o there are lots of URIs in the world • Will only get worse with: o more archives… o …doing more archiving
  • 14. Leverage Memento AggregatorsLeverage Memento Aggregators • Memento aggregator currently broadcast URI lookups to all known archives • New approach: 1. build profiles based on sampling from URI lookups (optionally supplement with CDX files when available) 2. Use archive profiles for informing Memento aggregator "query routing" decisions 3. Share serialized profiles with other IIPC partners http://mementoproxy.lanl.gov/aggr/timemap/link/1/http://www.bnf.fr/
  • 15. Profiling StudiesProfiling Studies • TPDL 2013 o 12 archives, March 2013, public web archives used but techniques apply generally o sampling only, no CDX access • IJDL 2014 (to appear) o 15 archives (+4, -1), October 2013 o slightly larger sample URI dataset o results similar
  • 16. URI Lookup = Limited InformationURI Lookup = Limited Information 16 GET /aggr/timegate/http://www.bnf.fr/ HTTP/1.1 Host: mementoproxy.lanl.gov Accept-Datetime: Sun, 29 May 2005 02:46:53 GMT Accept-Language: fr; q=1.0, en; q=0.5 … 1. Original URI 2. Memento-Datetime 3. Preferred URI 2 1 3
  • 17. Where to find Mementos for …Where to find Mementos for … 17 http://www.japantimes.co.jp/
  • 18. Where to find Mementos for …Where to find Mementos for … 18 http://www.japantimes.co.jp/
  • 19. Where to find Mementos for …Where to find Mementos for … 19 http://www.bnf.fr
  • 20. Where to find Mementos for …Where to find Mementos for … 20 http://www.bnf.fr
  • 21. Research QuestionResearch Question Problem • Profile public web archives according to the following dimensions: o Top-level domains o Languages o Growth rate o Archival date Motivation • Determine who is archiving what • Optimize query routing for a Memento Aggregator 21
  • 22. Web Archives in this ExperimentWeb Archives in this Experiment Full text URI-lookup Internet Archive √ Library of Congress √ Icelandic Web Archive √ Library and Archives Canada √ √ British Library √ √ UK National Library √ √ Portuguese Web Archive √ √ Web Archive of Catalonia √ √ Croatian Web Archive √ √ Archive of the Czech Web √ √ National Taiwan University √ √ Archive It √ √ 22
  • 23. Experiment Set UpExperiment Set Up • Sample URIs from seven different sources • Retrieve the TimeMap for each URI from all archives o A TimeMap lists all Mementos for a given URI o A Memento is an archived version of a resource • Analyze who has holdings for which URIs 23
  • 24. Sampling URIs - DMOZSampling URIs - DMOZ 1. DMOZ:Random o 10,000 URIs randomly sampled from DMOZ directory (~5M URIs). 2. DMOZ:TLD - 2% for each TLD from DMOZ or 100 URIs whichever is greater o 52 TLDs (com 23,470) (de 6,332), (org 4,025), (uk 3,309), (net 2,073), (it 1,775), (jp 1379), (ru 1244), (fr 1154), (pl 1062), (au 764), (ca 642), (at 438), (edu 390), (cz 385), (tr 334), (info 319), (cn 278), (us 266), (nz 265), (es 238), (ar 213), (no 150), (br 149), (tw 141), (za 118), (fi 113), ( 100 URIs for [ae, cat, cl, cu, eg, gov, id, in, ir, is, ke, kr, ma, mt, mx, my, na, pe, pk, pt, sa, to, uy, zw]) 3. DMOZ:Languages - 100 URIs for each language 1. 24 languages: Icelandic, Portuguese, Catalan, Afrikaans, Arabic, Indonesian, Chinese (Simplified), Chinese (Traditional), Dutch, Spanish, French, Greek, Hindi, Italian, Japanese, Korean, Norwegian, Persian, Polish , Russian, Turkish, Ukrainian 24
  • 25. • Query the fulltext search interface of select web archives with two sets of query terms. 4. Top 1-Gram from Bing o Most are English 4. Top 1000 query terms from Yahoo in 9 languages o Excluding general keywords such as: Obama, Facebook. 25 Sampling URIs – Web Archives Full TextSampling URIs – Web Archives Full Text
  • 26. 26 Sampling URIs – Web Archives Full TextSampling URIs – Web Archives Full Text
  • 27. 27 Sampling URIs – Web Archives Full TextSampling URIs – Web Archives Full Text
  • 28. Sampling URIs – User RequestsSampling URIs – User Requests • Sampling from user requests for archived web resources 6. Sample from IA Wayback Machine Log files o 1,000 URIs randomly sampled from Feb 22, 2012 to Feb 26, 2012. 6. Sample from Memento Aggregator log files o 100 URIs randomly sampled from LANL Memento Aggregator between 2011 to 2013. 28
  • 29. Archive Coverage per SampleArchive Coverage per Sample 29 1 0 0 % 3 5 % Entire Sample
  • 30. TLD Coverage across Archives (1)TLD Coverage across Archives (1) 30 Entire Sample
  • 31. TLD Coverage across Archives (2)TLD Coverage across Archives (2) 31 Entire Sample
  • 32. TLD Distribution per ArchiveTLD Distribution per Archive 32 DMOZ:TLD Sample
  • 33. TLD Distribution per ArchiveTLD Distribution per Archive 33 Web Archives Full Text Sample
  • 34. Language Coverage per ArchiveLanguage Coverage per Archive 34 DMOZ Sample
  • 35. Archive Growth RateArchive Growth Rate 35 Entire Sample
  • 36. Query Routing EvaluationQuery Routing Evaluation 36
  • 37. Study ResultsStudy Results • Introduced sampling to profile web archives using available infrastructure, no privileged access • Coverage: o Internet Archive provides broad coverage o National archives have good coverage for their domains o Surprising coverage by certain archives • Query Routing: o In 84% of the cases, all existing Mementos for a TLD can be found by using IA and two additional top archives for a TLD o In 55% of the cases, all existing Mementos for a TLD can be found by using the top 3 archives for a TLD, excluding IA 37
  • 38. Next Steps With the IIPCNext Steps With the IIPC 38 • Finding the right granularity o too fine: http://www.bnf.fr/fr/evenements_et_culture/a.passe_bnf.html o too coarse: .fr o just right?: bnf.fr, www.bnf.fr, gallica.bnf.fr, www.bnf.fr/fr/ • Generating profiles o what are desirable / representative sample sets: domains, languages, regions, etc. -- what's missing? o local CDX analysis tools (can help with cold start problem) • Profile format o community input (yet another metadata format) o github (or other tools) for exchange & integration
  • 39. {"Profile":{ "Name":"Taiwan Web Archive", "URI":"http://webarchive.lib.ntu.edu.tw", "TimeGate": "http://mementoproxy.cs.odu.edu/tw/timegate/", "Code":"TW", "Age":"Tue, 15 Jul 1997 00:00:00 GMT", "TLD":[{"tw":0.6},{"cn":0.08},{"hk":0.04}, {"eg":0.04},{"gov":0.04},{"my":0.04}, {"jp":0.04},{"kr":0.02}], "Language":[{"zh-TW":0.5},{"zh-CN":0.25}, {"id":0.08},{"ar":0.08}], "GrowthRate":[ {"199707":[4,4]},{"200202":[1,1]}, {"200607":[30,62]},{"200608":[20,80]}, {"200609":[5,9]},{"200612":[77,129]}, ... // other values truncated {"201308":[7,94]},{"201309":[2,94]}] } } A Possible SerializationA Possible Serialization
  • 42. {Light, Dim, Dark} Archives{Light, Dim, Dark} Archives 42 • Work to date has assumed light archives because our focus has been on sampling archives we don't control • Applicable to a continuum of archives: o download/fork and run "dark-sample.py" o it accesses sample URIs from IIPC github o issues URI lookups to local archive o write/update your archive profile in IIPC github with machine readable IP restrictions o all profiles -- light/dim/dark -- now available to Memento aggregators and other IIPC analysis tools
  • 43. Profiles = Easy Discovery, SharingProfiles = Easy Discovery, Sharing http://netpreserve.org/aggr/timemap/link/1/http://www.bnf.fr/