Web archives preserve the fast changing web. While we can archive the web pages, the popularity of queries in the past has usually not been preserved. Previous studies have observed the importance of anchor text for improving the quality of text search, and have shown that anchor text is similar to real user queries and documents titles. Other studies have shown that documents titles are similar to the real user queries. In this paper, we propose an approach to reconstruct the information that would be provided by query log in the past using temporal anchor text. First, we study the link graph of four years of Web
archive in order to show how the target hosts and anchor text evolve over time. Second, we investigate the importance of anchor text over time. Our approach is to rank anchor text based on their popularity in the archive at specific time. Then,
we check the importance of the top ranked anchor text in the public Web at the same time. In order to achieve this, we used the WikiStats dataset which aggregates page views of Wikipedia pages. Using exact string matching between top
ranked anchor text and Wikipedia titles in the WikiStats dataset, we find a high percentage of overlap (approximately 57%). Our data strengthens the hypothesis that anchor text may be used as a proxy for actual query volume.
2. Web Archiving 1/2
The Web is a major source of published
information
Content on the Web evolves and changes
continuously
Many initiatives aim to archive the Web
Petabytes of archived data
3. Web Archiving 2/2
Web archives are incomplete
Impossible to include all Web pages due to
crawling limitations e.g., [Masanès06]
Depth-first crawl, focus only on selected web sites
Breadth-first crawl, focus on the entire domain,
but not in depth
4. Reconstruct Queries
Our study: evolution of anchor text over time
to reconstruct what was important in the past
Information that would be similar to user queries
Inspiration:
Document titles can be used as an approximation
of user queries [Jin et al.]
Anchor text exhibits characteristics similar to user
query and document title [Eiron & McCurley]
5. Queries in the Past
User queries have usually not been preserved
Impossible to reconstruct which queries the
user would have used to search the archive
However, web archives contain more than the
Web page content
E.g., page source, different timestamps (archive
date, last-modified date), link structure
6. Link evidence and anchor Text
Link information represents the source URL,
destination URL, and the anchor text
Anchor text is a short text describing the
destination page
Has been shown to improve search effectiveness in a
large number of Information Retrieval studies
`
Source
http://www.cwi.nl
Destination
http://www.nwo.nl
‘NWO’
7. Data: Dutch Web Archive
National Library of the Netherlands (KB)
Depth-first (selective) Web archive
Since 2007
10+ TB
8,000+ websites
Our snapshot
2009-2012
8. Link Processing
Filtering text/html pages
~70% of archived
objects
URL: http://www.cwi.nl
Archive-Date: 20091201
Content-Type: text/html
<html>
<a href=http://www.nwo.nl> NWO </a>
</html>
Web Archive Record
9. Link Processing
Filtering text/html pages
~70% of archived objects
Extraction
Source URL
URL: http://www.cwi.nl
Archive-Date: 20091201
Content-Type: text/html
<html>
<a href=http://www.nwo.nl> NWO </a>
</html>
Web Archive Record
10. Link Processing
Filtering text/html pages
~70% of archived objects
Extraction
Source URL
Destination URL
URL: http://www.cwi.nl
Archive-Date: 20091201
Content-Type: text/html
<html>
<a href=http://www.nwo.nl >NWO </a>
</html>
Web Archive Record
11. Link Processing
Filtering text/html pages
~70% of archived objects
Extraction
Source URL
Destination URL
Anchor text
URL: http://www.cwi.nl
Archive-Date: 20091201
Content-Type: text/html
<html>
<a href=http://www.nwo.nl> NWO </a>
</html>
Web Archive Record
12. Link Processing
Filtering text/html pages
~70% of archived objects
Extraction
Source URL
Destination URL
Anchor text
Archive-date
(YYYYMM)
URL: http://www.cwi.nl
Archive-Date: 20091201
Content-Type: text/html
<html>
<a href=http://www.nwo.nl> NWO </a>
</html>
Web Archive Record
13. Link Processing
Filtering
Pages of type text/html
~70% of archived objects
Extraction
Source URL
Destination URL
Anchor text
Crawl-date (YYYYMM)
URL normalization; get host of
the source and the destination
Clean spam e.g., rolex watches
Cleaning
14. Link Processing
Filtering
Pages of type text/html
~70% of archived objects
Extraction
Source URL
Destination URL
Anchor text
Crawl-date (YYYYMM)
Cleaning
URL normalization; get host of the source
and the destination
Clean spam e.g., rolex watches
Partitioning Based on one-year and one-month granularity
15. Link Processing
Filtering
Pages of type text/html
~70% of archived objects
Extraction
Source URL
Destination URL
Anchor text
Crawl-date (YYYYMM)
Cleaning
URL normalization; get host of the source
and the destination
Clean spam e.g., rolex watches
Partitioning Based on one-year and one-month granularity
Deduplication
Remove duplicate links; due to crawling
frequency
Same source, destination, and anchor text
16. Hosts Evolution
Important hosts overtime
Aggregate links based on the target host
keep unique source hosts
Multiple pages from same host linking to the same
target host are counted as one
Rank hosts based on number of source hosts
linking to them
17. % of new hosts over the years
% New hosts in 2012 not
in {2009, 2010, and
2011}
18. Anchor Text Evolution
Measure the importance of anchor text a over
time in time-partitioned links
Aggregate by anchor text
Compute the archive-based popularity
Normalize by Maximum
19. % new anchor text over years
Anchor text is new in specific partition if does
not appear in the previous partitions
Based on one-year granularity
59% new anchor text
Based on one-month granularity
34% new anchor text
20. WikiStats
Views aggregation of Wikipedia (WP) pages
From Jan 2008 to Jan 2015
We focus on
Feb 2009 to Dec 2012
Similar to the period of our snapshot of the Dutch
Web archive
Keep WP titles viewed >= 1,000 times
21. Matching anchor text to WP titles
Pre-process WP titles like the anchor text
Lowercase
Stop-words removing
One-year and one-month granularity partitions
Collect titles by exact match with the anchors
Assume anchor popularity equals WP page
popularity
22. Ranked anchor text with WP match
Different rank cut-off
% overlap
decreases while
cut-off increases
~56 % in top-
1k has a match
23. Examples of popular anchor text (with match)
Major cities in the Netherlands
E.g., Amsterdam, Rotterdam, Groningen, and Utrecht
Social web sites
E.g., twitter, linkedin, flickr, and vimeo
Major Dutch daily newspapers
E.g., de Volkskrant, Telegraaf, and Trouw
Dutch public broadcasting
uitzending gemist
Government web service
E.g., belastingdienst
24. Discussion
Our original goal was to identify historically
trending events from the link evolution
recorded in the archive
Unfortunately we found only few examples
with our current analysis
E.g., ‘‘canon’’ *
However, important anchor text provides and
overview of important Dutch entities
* corresponding to an activity initiated by the government to define
the canonical historic events in Dutch history
25. Limitations & Future Work
Exact text matching between anchor text and
WP title
E.g., filmpje does not match WP title filmpje!
Additional pre-processing
Stemming, stopping, generalize from exact match to
match with low edit distance
Our analysis is based on depth-first crawl of
few thousand of Dutch websites
Breadth-first crawl such as [CommonCrawl]
26. References
[Masanés06] J. Masanés. Web Archiving. Springer, 2006
[Jin et al.] Rong Jin, Alexander G. Hauptmann, and ChengXiang Zhai.
Title language model for information retrieval. In SIGIR 2002
Eiron & McCurley Nadav Eiron and Kevin S. McCurley. Analysis of
anchor text for web search. In SIGIR 2003
[CommonCrawl] https://commoncrawl.org/
[WikiStats] http://wikistats.ins.cwi.nl/
27. Limitations & Future Work
Exact text matching between anchor text and
WP title
E.g., filmpje does not match WP title filmpje!
Additional pre-processing
Stemming, stopping, generalize from exact match to
match with low edit distance
Our analysis is based on depth-first crawl of
few thousand of Dutch websites
Breadth-first crawl such as [CommonCrawl]