SlideShare ist ein Scribd-Unternehmen logo
1 von 27
Downloaden Sie, um offline zu lesen
Temporal Anchor Text as Proxy for User Queries
Thaer Samar, Arjen P. de Vries
Web Archiving 1/2
 The Web is a major source of published
information
 Content on the Web evolves and changes
continuously
 Many initiatives aim to archive the Web
 Petabytes of archived data
Web Archiving 2/2
 Web archives are incomplete
 Impossible to include all Web pages due to
crawling limitations e.g., [Masanès06]
 Depth-first crawl, focus only on selected web sites
 Breadth-first crawl, focus on the entire domain,
but not in depth
Reconstruct Queries
 Our study: evolution of anchor text over time
to reconstruct what was important in the past
 Information that would be similar to user queries
 Inspiration:
 Document titles can be used as an approximation
of user queries [Jin et al.]
 Anchor text exhibits characteristics similar to user
query and document title [Eiron & McCurley]
Queries in the Past
 User queries have usually not been preserved
 Impossible to reconstruct which queries the
user would have used to search the archive
 However, web archives contain more than the
Web page content
 E.g., page source, different timestamps (archive
date, last-modified date), link structure
Link evidence and anchor Text
 Link information represents the source URL,
destination URL, and the anchor text
 Anchor text is a short text describing the
destination page
 Has been shown to improve search effectiveness in a
large number of Information Retrieval studies
`
Source
http://www.cwi.nl
Destination
http://www.nwo.nl
‘NWO’
Data: Dutch Web Archive
 National Library of the Netherlands (KB)
 Depth-first (selective) Web archive
 Since 2007
 10+ TB
 8,000+ websites
 Our snapshot
 2009-2012
Link Processing
Filtering  text/html pages
 ~70% of archived
objects
URL: http://www.cwi.nl
Archive-Date: 20091201
Content-Type: text/html
<html>
<a href=http://www.nwo.nl> NWO </a>
</html>
Web Archive Record
Link Processing
Filtering  text/html pages
 ~70% of archived objects
Extraction
 Source URL
URL: http://www.cwi.nl
Archive-Date: 20091201
Content-Type: text/html
<html>
<a href=http://www.nwo.nl> NWO </a>
</html>
Web Archive Record
Link Processing
Filtering  text/html pages
 ~70% of archived objects
Extraction
 Source URL
 Destination URL
URL: http://www.cwi.nl
Archive-Date: 20091201
Content-Type: text/html
<html>
<a href=http://www.nwo.nl >NWO </a>
</html>
Web Archive Record
Link Processing
Filtering  text/html pages
 ~70% of archived objects
Extraction
 Source URL
 Destination URL
 Anchor text
URL: http://www.cwi.nl
Archive-Date: 20091201
Content-Type: text/html
<html>
<a href=http://www.nwo.nl> NWO </a>
</html>
Web Archive Record
Link Processing
Filtering  text/html pages
 ~70% of archived objects
Extraction
 Source URL
 Destination URL
 Anchor text
 Archive-date
(YYYYMM)
URL: http://www.cwi.nl
Archive-Date: 20091201
Content-Type: text/html
<html>
<a href=http://www.nwo.nl> NWO </a>
</html>
Web Archive Record
Link Processing
Filtering
 Pages of type text/html
 ~70% of archived objects
Extraction
 Source URL
 Destination URL
 Anchor text
 Crawl-date (YYYYMM)
 URL normalization; get host of
the source and the destination
 Clean spam e.g., rolex watches
Cleaning
Link Processing
Filtering
 Pages of type text/html
 ~70% of archived objects
Extraction
 Source URL
 Destination URL
 Anchor text
 Crawl-date (YYYYMM)
Cleaning
 URL normalization; get host of the source
and the destination
 Clean spam e.g., rolex watches
Partitioning  Based on one-year and one-month granularity
Link Processing
Filtering
 Pages of type text/html
 ~70% of archived objects
Extraction
 Source URL
 Destination URL
 Anchor text
 Crawl-date (YYYYMM)
Cleaning
 URL normalization; get host of the source
and the destination
 Clean spam e.g., rolex watches
Partitioning  Based on one-year and one-month granularity
Deduplication
 Remove duplicate links; due to crawling
frequency
 Same source, destination, and anchor text
Hosts Evolution
 Important hosts overtime
 Aggregate links based on the target host
 keep unique source hosts
 Multiple pages from same host linking to the same
target host are counted as one
 Rank hosts based on number of source hosts
linking to them
% of new hosts over the years
% New hosts in 2012 not
in {2009, 2010, and
2011}
Anchor Text Evolution
 Measure the importance of anchor text a over
time in time-partitioned links
 Aggregate by anchor text
 Compute the archive-based popularity
 Normalize by Maximum
% new anchor text over years
 Anchor text is new in specific partition if does
not appear in the previous partitions
 Based on one-year granularity
 59% new anchor text
 Based on one-month granularity
 34% new anchor text
WikiStats
 Views aggregation of Wikipedia (WP) pages
 From Jan 2008 to Jan 2015
 We focus on
 Feb 2009 to Dec 2012
 Similar to the period of our snapshot of the Dutch
Web archive
 Keep WP titles viewed >= 1,000 times
Matching anchor text to WP titles
 Pre-process WP titles like the anchor text
 Lowercase
 Stop-words removing
 One-year and one-month granularity partitions
 Collect titles by exact match with the anchors
 Assume anchor popularity equals WP page
popularity
Ranked anchor text with WP match
 Different rank cut-off
% overlap
decreases while
cut-off increases
~56 % in top-
1k has a match
Examples of popular anchor text (with match)
 Major cities in the Netherlands
 E.g., Amsterdam, Rotterdam, Groningen, and Utrecht
 Social web sites
 E.g., twitter, linkedin, flickr, and vimeo
 Major Dutch daily newspapers
 E.g., de Volkskrant, Telegraaf, and Trouw
 Dutch public broadcasting
 uitzending gemist
 Government web service
 E.g., belastingdienst
Discussion
 Our original goal was to identify historically
trending events from the link evolution
recorded in the archive
 Unfortunately we found only few examples
with our current analysis
 E.g., ‘‘canon’’ *
 However, important anchor text provides and
overview of important Dutch entities
* corresponding to an activity initiated by the government to define
the canonical historic events in Dutch history
Limitations & Future Work
 Exact text matching between anchor text and
WP title
 E.g., filmpje does not match WP title filmpje!
 Additional pre-processing
 Stemming, stopping, generalize from exact match to
match with low edit distance
 Our analysis is based on depth-first crawl of
few thousand of Dutch websites
 Breadth-first crawl such as [CommonCrawl]
References
 [Masanés06] J. Masanés. Web Archiving. Springer, 2006
 [Jin et al.] Rong Jin, Alexander G. Hauptmann, and ChengXiang Zhai.
Title language model for information retrieval. In SIGIR 2002
 Eiron & McCurley Nadav Eiron and Kevin S. McCurley. Analysis of
anchor text for web search. In SIGIR 2003
 [CommonCrawl] https://commoncrawl.org/
 [WikiStats] http://wikistats.ins.cwi.nl/
Limitations & Future Work
 Exact text matching between anchor text and
WP title
 E.g., filmpje does not match WP title filmpje!
 Additional pre-processing
 Stemming, stopping, generalize from exact match to
match with low edit distance
 Our analysis is based on depth-first crawl of
few thousand of Dutch websites
 Breadth-first crawl such as [CommonCrawl]

Weitere ähnliche Inhalte

Ähnlich wie Temporal Anchor Text as Proxy for user Queries

Tpdl 2016 topic_coverage_indf-bf_crawls
Tpdl 2016 topic_coverage_indf-bf_crawlsTpdl 2016 topic_coverage_indf-bf_crawls
Tpdl 2016 topic_coverage_indf-bf_crawlsThaer Samar
 
FYCOM Unit 1.pptx (2).pdf
FYCOM Unit 1.pptx (2).pdfFYCOM Unit 1.pptx (2).pdf
FYCOM Unit 1.pptx (2).pdfssuserc8e1481
 
Web Design Basics and HTML
Web Design Basics and HTMLWeb Design Basics and HTML
Web Design Basics and HTMLRajesh Sanabada
 
Training report on web developing
Training report on web developingTraining report on web developing
Training report on web developingJawhar Ali
 
Explore SharePoint 2010 Enterprise & Document Management features
Explore SharePoint 2010 Enterprise & Document Management features Explore SharePoint 2010 Enterprise & Document Management features
Explore SharePoint 2010 Enterprise & Document Management features K.Mohamed Faizal
 
World wide web An Introduction
World wide web An IntroductionWorld wide web An Introduction
World wide web An IntroductionSidrah Noor
 
Knowledge Centers without a Taxonomy (KM World 2014)
Knowledge Centers without a Taxonomy (KM World 2014)Knowledge Centers without a Taxonomy (KM World 2014)
Knowledge Centers without a Taxonomy (KM World 2014)Rob Kocher
 
COLLECTION METHODS
COLLECTION METHODSCOLLECTION METHODS
COLLECTION METHODSEssam Obaid
 
DM110 - Week 2 - Blogs
DM110 - Week 2 - BlogsDM110 - Week 2 - Blogs
DM110 - Week 2 - BlogsJohn Breslin
 
Web publishing
Web publishingWeb publishing
Web publishingKanav Sood
 
Content Analysis: Methods and Mentoring
Content Analysis: Methods and MentoringContent Analysis: Methods and Mentoring
Content Analysis: Methods and MentoringChiara Fox Ogan
 
Html workshop 1
Html workshop 1Html workshop 1
Html workshop 1Lee Scott
 
AWS Public Data Sets: How to Stage Petabytes of Data for Analysis in AWS (WPS...
AWS Public Data Sets: How to Stage Petabytes of Data for Analysis in AWS (WPS...AWS Public Data Sets: How to Stage Petabytes of Data for Analysis in AWS (WPS...
AWS Public Data Sets: How to Stage Petabytes of Data for Analysis in AWS (WPS...Amazon Web Services
 

Ähnlich wie Temporal Anchor Text as Proxy for user Queries (20)

Tpdl 2016 topic_coverage_indf-bf_crawls
Tpdl 2016 topic_coverage_indf-bf_crawlsTpdl 2016 topic_coverage_indf-bf_crawls
Tpdl 2016 topic_coverage_indf-bf_crawls
 
FYCOM Unit 1.pptx (2).pdf
FYCOM Unit 1.pptx (2).pdfFYCOM Unit 1.pptx (2).pdf
FYCOM Unit 1.pptx (2).pdf
 
Web Design Basics and HTML
Web Design Basics and HTMLWeb Design Basics and HTML
Web Design Basics and HTML
 
FYCOM Unit 1.pptx
FYCOM Unit 1.pptxFYCOM Unit 1.pptx
FYCOM Unit 1.pptx
 
Web+html
Web+htmlWeb+html
Web+html
 
Web content mining
Web content miningWeb content mining
Web content mining
 
Training report on web developing
Training report on web developingTraining report on web developing
Training report on web developing
 
Explore SharePoint 2010 Enterprise & Document Management features
Explore SharePoint 2010 Enterprise & Document Management features Explore SharePoint 2010 Enterprise & Document Management features
Explore SharePoint 2010 Enterprise & Document Management features
 
World wide web An Introduction
World wide web An IntroductionWorld wide web An Introduction
World wide web An Introduction
 
Web Search Engine
Web Search EngineWeb Search Engine
Web Search Engine
 
Knowledge Centers without a Taxonomy (KM World 2014)
Knowledge Centers without a Taxonomy (KM World 2014)Knowledge Centers without a Taxonomy (KM World 2014)
Knowledge Centers without a Taxonomy (KM World 2014)
 
COLLECTION METHODS
COLLECTION METHODSCOLLECTION METHODS
COLLECTION METHODS
 
DM110 - Week 2 - Blogs
DM110 - Week 2 - BlogsDM110 - Week 2 - Blogs
DM110 - Week 2 - Blogs
 
Web publishing
Web publishingWeb publishing
Web publishing
 
Raju html
Raju htmlRaju html
Raju html
 
Content Analysis: Methods and Mentoring
Content Analysis: Methods and MentoringContent Analysis: Methods and Mentoring
Content Analysis: Methods and Mentoring
 
Html workshop 1
Html workshop 1Html workshop 1
Html workshop 1
 
Internet
InternetInternet
Internet
 
Web Pages
Web PagesWeb Pages
Web Pages
 
AWS Public Data Sets: How to Stage Petabytes of Data for Analysis in AWS (WPS...
AWS Public Data Sets: How to Stage Petabytes of Data for Analysis in AWS (WPS...AWS Public Data Sets: How to Stage Petabytes of Data for Analysis in AWS (WPS...
AWS Public Data Sets: How to Stage Petabytes of Data for Analysis in AWS (WPS...
 

Kürzlich hochgeladen

Botany 4th semester file By Sumit Kumar yadav.pdf
Botany 4th semester file By Sumit Kumar yadav.pdfBotany 4th semester file By Sumit Kumar yadav.pdf
Botany 4th semester file By Sumit Kumar yadav.pdfSumit Kumar yadav
 
Physiochemical properties of nanomaterials and its nanotoxicity.pptx
Physiochemical properties of nanomaterials and its nanotoxicity.pptxPhysiochemical properties of nanomaterials and its nanotoxicity.pptx
Physiochemical properties of nanomaterials and its nanotoxicity.pptxAArockiyaNisha
 
Nanoparticles synthesis and characterization​ ​
Nanoparticles synthesis and characterization​  ​Nanoparticles synthesis and characterization​  ​
Nanoparticles synthesis and characterization​ ​kaibalyasahoo82800
 
Call Girls in Mayapuri Delhi 💯Call Us 🔝9953322196🔝 💯Escort.
Call Girls in Mayapuri Delhi 💯Call Us 🔝9953322196🔝 💯Escort.Call Girls in Mayapuri Delhi 💯Call Us 🔝9953322196🔝 💯Escort.
Call Girls in Mayapuri Delhi 💯Call Us 🔝9953322196🔝 💯Escort.aasikanpl
 
Presentation Vikram Lander by Vedansh Gupta.pptx
Presentation Vikram Lander by Vedansh Gupta.pptxPresentation Vikram Lander by Vedansh Gupta.pptx
Presentation Vikram Lander by Vedansh Gupta.pptxgindu3009
 
Types of different blotting techniques.pptx
Types of different blotting techniques.pptxTypes of different blotting techniques.pptx
Types of different blotting techniques.pptxkhadijarafiq2012
 
Is RISC-V ready for HPC workload? Maybe?
Is RISC-V ready for HPC workload? Maybe?Is RISC-V ready for HPC workload? Maybe?
Is RISC-V ready for HPC workload? Maybe?Patrick Diehl
 
All-domain Anomaly Resolution Office U.S. Department of Defense (U) Case: “Eg...
All-domain Anomaly Resolution Office U.S. Department of Defense (U) Case: “Eg...All-domain Anomaly Resolution Office U.S. Department of Defense (U) Case: “Eg...
All-domain Anomaly Resolution Office U.S. Department of Defense (U) Case: “Eg...Sérgio Sacani
 
Stunning ➥8448380779▻ Call Girls In Panchshil Enclave Delhi NCR
Stunning ➥8448380779▻ Call Girls In Panchshil Enclave Delhi NCRStunning ➥8448380779▻ Call Girls In Panchshil Enclave Delhi NCR
Stunning ➥8448380779▻ Call Girls In Panchshil Enclave Delhi NCRDelhi Call girls
 
Biological Classification BioHack (3).pdf
Biological Classification BioHack (3).pdfBiological Classification BioHack (3).pdf
Biological Classification BioHack (3).pdfmuntazimhurra
 
Spermiogenesis or Spermateleosis or metamorphosis of spermatid
Spermiogenesis or Spermateleosis or metamorphosis of spermatidSpermiogenesis or Spermateleosis or metamorphosis of spermatid
Spermiogenesis or Spermateleosis or metamorphosis of spermatidSarthak Sekhar Mondal
 
Traditional Agroforestry System in India- Shifting Cultivation, Taungya, Home...
Traditional Agroforestry System in India- Shifting Cultivation, Taungya, Home...Traditional Agroforestry System in India- Shifting Cultivation, Taungya, Home...
Traditional Agroforestry System in India- Shifting Cultivation, Taungya, Home...jana861314
 
A relative description on Sonoporation.pdf
A relative description on Sonoporation.pdfA relative description on Sonoporation.pdf
A relative description on Sonoporation.pdfnehabiju2046
 
Work, Energy and Power for class 10 ICSE Physics
Work, Energy and Power for class 10 ICSE PhysicsWork, Energy and Power for class 10 ICSE Physics
Work, Energy and Power for class 10 ICSE Physicsvishikhakeshava1
 
Recombinant DNA technology (Immunological screening)
Recombinant DNA technology (Immunological screening)Recombinant DNA technology (Immunological screening)
Recombinant DNA technology (Immunological screening)PraveenaKalaiselvan1
 
CALL ON ➥8923113531 🔝Call Girls Kesar Bagh Lucknow best Night Fun service 🪡
CALL ON ➥8923113531 🔝Call Girls Kesar Bagh Lucknow best Night Fun service  🪡CALL ON ➥8923113531 🔝Call Girls Kesar Bagh Lucknow best Night Fun service  🪡
CALL ON ➥8923113531 🔝Call Girls Kesar Bagh Lucknow best Night Fun service 🪡anilsa9823
 
Disentangling the origin of chemical differences using GHOST
Disentangling the origin of chemical differences using GHOSTDisentangling the origin of chemical differences using GHOST
Disentangling the origin of chemical differences using GHOSTSérgio Sacani
 
Recombination DNA Technology (Nucleic Acid Hybridization )
Recombination DNA Technology (Nucleic Acid Hybridization )Recombination DNA Technology (Nucleic Acid Hybridization )
Recombination DNA Technology (Nucleic Acid Hybridization )aarthirajkumar25
 
Cultivation of KODO MILLET . made by Ghanshyam pptx
Cultivation of KODO MILLET . made by Ghanshyam pptxCultivation of KODO MILLET . made by Ghanshyam pptx
Cultivation of KODO MILLET . made by Ghanshyam pptxpradhanghanshyam7136
 

Kürzlich hochgeladen (20)

Botany 4th semester file By Sumit Kumar yadav.pdf
Botany 4th semester file By Sumit Kumar yadav.pdfBotany 4th semester file By Sumit Kumar yadav.pdf
Botany 4th semester file By Sumit Kumar yadav.pdf
 
Physiochemical properties of nanomaterials and its nanotoxicity.pptx
Physiochemical properties of nanomaterials and its nanotoxicity.pptxPhysiochemical properties of nanomaterials and its nanotoxicity.pptx
Physiochemical properties of nanomaterials and its nanotoxicity.pptx
 
Nanoparticles synthesis and characterization​ ​
Nanoparticles synthesis and characterization​  ​Nanoparticles synthesis and characterization​  ​
Nanoparticles synthesis and characterization​ ​
 
Call Girls in Mayapuri Delhi 💯Call Us 🔝9953322196🔝 💯Escort.
Call Girls in Mayapuri Delhi 💯Call Us 🔝9953322196🔝 💯Escort.Call Girls in Mayapuri Delhi 💯Call Us 🔝9953322196🔝 💯Escort.
Call Girls in Mayapuri Delhi 💯Call Us 🔝9953322196🔝 💯Escort.
 
Presentation Vikram Lander by Vedansh Gupta.pptx
Presentation Vikram Lander by Vedansh Gupta.pptxPresentation Vikram Lander by Vedansh Gupta.pptx
Presentation Vikram Lander by Vedansh Gupta.pptx
 
Types of different blotting techniques.pptx
Types of different blotting techniques.pptxTypes of different blotting techniques.pptx
Types of different blotting techniques.pptx
 
Is RISC-V ready for HPC workload? Maybe?
Is RISC-V ready for HPC workload? Maybe?Is RISC-V ready for HPC workload? Maybe?
Is RISC-V ready for HPC workload? Maybe?
 
All-domain Anomaly Resolution Office U.S. Department of Defense (U) Case: “Eg...
All-domain Anomaly Resolution Office U.S. Department of Defense (U) Case: “Eg...All-domain Anomaly Resolution Office U.S. Department of Defense (U) Case: “Eg...
All-domain Anomaly Resolution Office U.S. Department of Defense (U) Case: “Eg...
 
Stunning ➥8448380779▻ Call Girls In Panchshil Enclave Delhi NCR
Stunning ➥8448380779▻ Call Girls In Panchshil Enclave Delhi NCRStunning ➥8448380779▻ Call Girls In Panchshil Enclave Delhi NCR
Stunning ➥8448380779▻ Call Girls In Panchshil Enclave Delhi NCR
 
Biological Classification BioHack (3).pdf
Biological Classification BioHack (3).pdfBiological Classification BioHack (3).pdf
Biological Classification BioHack (3).pdf
 
Spermiogenesis or Spermateleosis or metamorphosis of spermatid
Spermiogenesis or Spermateleosis or metamorphosis of spermatidSpermiogenesis or Spermateleosis or metamorphosis of spermatid
Spermiogenesis or Spermateleosis or metamorphosis of spermatid
 
The Philosophy of Science
The Philosophy of ScienceThe Philosophy of Science
The Philosophy of Science
 
Traditional Agroforestry System in India- Shifting Cultivation, Taungya, Home...
Traditional Agroforestry System in India- Shifting Cultivation, Taungya, Home...Traditional Agroforestry System in India- Shifting Cultivation, Taungya, Home...
Traditional Agroforestry System in India- Shifting Cultivation, Taungya, Home...
 
A relative description on Sonoporation.pdf
A relative description on Sonoporation.pdfA relative description on Sonoporation.pdf
A relative description on Sonoporation.pdf
 
Work, Energy and Power for class 10 ICSE Physics
Work, Energy and Power for class 10 ICSE PhysicsWork, Energy and Power for class 10 ICSE Physics
Work, Energy and Power for class 10 ICSE Physics
 
Recombinant DNA technology (Immunological screening)
Recombinant DNA technology (Immunological screening)Recombinant DNA technology (Immunological screening)
Recombinant DNA technology (Immunological screening)
 
CALL ON ➥8923113531 🔝Call Girls Kesar Bagh Lucknow best Night Fun service 🪡
CALL ON ➥8923113531 🔝Call Girls Kesar Bagh Lucknow best Night Fun service  🪡CALL ON ➥8923113531 🔝Call Girls Kesar Bagh Lucknow best Night Fun service  🪡
CALL ON ➥8923113531 🔝Call Girls Kesar Bagh Lucknow best Night Fun service 🪡
 
Disentangling the origin of chemical differences using GHOST
Disentangling the origin of chemical differences using GHOSTDisentangling the origin of chemical differences using GHOST
Disentangling the origin of chemical differences using GHOST
 
Recombination DNA Technology (Nucleic Acid Hybridization )
Recombination DNA Technology (Nucleic Acid Hybridization )Recombination DNA Technology (Nucleic Acid Hybridization )
Recombination DNA Technology (Nucleic Acid Hybridization )
 
Cultivation of KODO MILLET . made by Ghanshyam pptx
Cultivation of KODO MILLET . made by Ghanshyam pptxCultivation of KODO MILLET . made by Ghanshyam pptx
Cultivation of KODO MILLET . made by Ghanshyam pptx
 

Temporal Anchor Text as Proxy for user Queries

  • 1. Temporal Anchor Text as Proxy for User Queries Thaer Samar, Arjen P. de Vries
  • 2. Web Archiving 1/2  The Web is a major source of published information  Content on the Web evolves and changes continuously  Many initiatives aim to archive the Web  Petabytes of archived data
  • 3. Web Archiving 2/2  Web archives are incomplete  Impossible to include all Web pages due to crawling limitations e.g., [Masanès06]  Depth-first crawl, focus only on selected web sites  Breadth-first crawl, focus on the entire domain, but not in depth
  • 4. Reconstruct Queries  Our study: evolution of anchor text over time to reconstruct what was important in the past  Information that would be similar to user queries  Inspiration:  Document titles can be used as an approximation of user queries [Jin et al.]  Anchor text exhibits characteristics similar to user query and document title [Eiron & McCurley]
  • 5. Queries in the Past  User queries have usually not been preserved  Impossible to reconstruct which queries the user would have used to search the archive  However, web archives contain more than the Web page content  E.g., page source, different timestamps (archive date, last-modified date), link structure
  • 6. Link evidence and anchor Text  Link information represents the source URL, destination URL, and the anchor text  Anchor text is a short text describing the destination page  Has been shown to improve search effectiveness in a large number of Information Retrieval studies ` Source http://www.cwi.nl Destination http://www.nwo.nl ‘NWO’
  • 7. Data: Dutch Web Archive  National Library of the Netherlands (KB)  Depth-first (selective) Web archive  Since 2007  10+ TB  8,000+ websites  Our snapshot  2009-2012
  • 8. Link Processing Filtering  text/html pages  ~70% of archived objects URL: http://www.cwi.nl Archive-Date: 20091201 Content-Type: text/html <html> <a href=http://www.nwo.nl> NWO </a> </html> Web Archive Record
  • 9. Link Processing Filtering  text/html pages  ~70% of archived objects Extraction  Source URL URL: http://www.cwi.nl Archive-Date: 20091201 Content-Type: text/html <html> <a href=http://www.nwo.nl> NWO </a> </html> Web Archive Record
  • 10. Link Processing Filtering  text/html pages  ~70% of archived objects Extraction  Source URL  Destination URL URL: http://www.cwi.nl Archive-Date: 20091201 Content-Type: text/html <html> <a href=http://www.nwo.nl >NWO </a> </html> Web Archive Record
  • 11. Link Processing Filtering  text/html pages  ~70% of archived objects Extraction  Source URL  Destination URL  Anchor text URL: http://www.cwi.nl Archive-Date: 20091201 Content-Type: text/html <html> <a href=http://www.nwo.nl> NWO </a> </html> Web Archive Record
  • 12. Link Processing Filtering  text/html pages  ~70% of archived objects Extraction  Source URL  Destination URL  Anchor text  Archive-date (YYYYMM) URL: http://www.cwi.nl Archive-Date: 20091201 Content-Type: text/html <html> <a href=http://www.nwo.nl> NWO </a> </html> Web Archive Record
  • 13. Link Processing Filtering  Pages of type text/html  ~70% of archived objects Extraction  Source URL  Destination URL  Anchor text  Crawl-date (YYYYMM)  URL normalization; get host of the source and the destination  Clean spam e.g., rolex watches Cleaning
  • 14. Link Processing Filtering  Pages of type text/html  ~70% of archived objects Extraction  Source URL  Destination URL  Anchor text  Crawl-date (YYYYMM) Cleaning  URL normalization; get host of the source and the destination  Clean spam e.g., rolex watches Partitioning  Based on one-year and one-month granularity
  • 15. Link Processing Filtering  Pages of type text/html  ~70% of archived objects Extraction  Source URL  Destination URL  Anchor text  Crawl-date (YYYYMM) Cleaning  URL normalization; get host of the source and the destination  Clean spam e.g., rolex watches Partitioning  Based on one-year and one-month granularity Deduplication  Remove duplicate links; due to crawling frequency  Same source, destination, and anchor text
  • 16. Hosts Evolution  Important hosts overtime  Aggregate links based on the target host  keep unique source hosts  Multiple pages from same host linking to the same target host are counted as one  Rank hosts based on number of source hosts linking to them
  • 17. % of new hosts over the years % New hosts in 2012 not in {2009, 2010, and 2011}
  • 18. Anchor Text Evolution  Measure the importance of anchor text a over time in time-partitioned links  Aggregate by anchor text  Compute the archive-based popularity  Normalize by Maximum
  • 19. % new anchor text over years  Anchor text is new in specific partition if does not appear in the previous partitions  Based on one-year granularity  59% new anchor text  Based on one-month granularity  34% new anchor text
  • 20. WikiStats  Views aggregation of Wikipedia (WP) pages  From Jan 2008 to Jan 2015  We focus on  Feb 2009 to Dec 2012  Similar to the period of our snapshot of the Dutch Web archive  Keep WP titles viewed >= 1,000 times
  • 21. Matching anchor text to WP titles  Pre-process WP titles like the anchor text  Lowercase  Stop-words removing  One-year and one-month granularity partitions  Collect titles by exact match with the anchors  Assume anchor popularity equals WP page popularity
  • 22. Ranked anchor text with WP match  Different rank cut-off % overlap decreases while cut-off increases ~56 % in top- 1k has a match
  • 23. Examples of popular anchor text (with match)  Major cities in the Netherlands  E.g., Amsterdam, Rotterdam, Groningen, and Utrecht  Social web sites  E.g., twitter, linkedin, flickr, and vimeo  Major Dutch daily newspapers  E.g., de Volkskrant, Telegraaf, and Trouw  Dutch public broadcasting  uitzending gemist  Government web service  E.g., belastingdienst
  • 24. Discussion  Our original goal was to identify historically trending events from the link evolution recorded in the archive  Unfortunately we found only few examples with our current analysis  E.g., ‘‘canon’’ *  However, important anchor text provides and overview of important Dutch entities * corresponding to an activity initiated by the government to define the canonical historic events in Dutch history
  • 25. Limitations & Future Work  Exact text matching between anchor text and WP title  E.g., filmpje does not match WP title filmpje!  Additional pre-processing  Stemming, stopping, generalize from exact match to match with low edit distance  Our analysis is based on depth-first crawl of few thousand of Dutch websites  Breadth-first crawl such as [CommonCrawl]
  • 26. References  [Masanés06] J. Masanés. Web Archiving. Springer, 2006  [Jin et al.] Rong Jin, Alexander G. Hauptmann, and ChengXiang Zhai. Title language model for information retrieval. In SIGIR 2002  Eiron & McCurley Nadav Eiron and Kevin S. McCurley. Analysis of anchor text for web search. In SIGIR 2003  [CommonCrawl] https://commoncrawl.org/  [WikiStats] http://wikistats.ins.cwi.nl/
  • 27. Limitations & Future Work  Exact text matching between anchor text and WP title  E.g., filmpje does not match WP title filmpje!  Additional pre-processing  Stemming, stopping, generalize from exact match to match with low edit distance  Our analysis is based on depth-first crawl of few thousand of Dutch websites  Breadth-first crawl such as [CommonCrawl]