Temporal Anchor Text as Proxy for user Queries

Temporal Anchor Text as Proxy for User Queries
Thaer Samar, Arjen P. de Vries

Web Archiving 1/2
 The Web is a major source of published
information
 Content on the Web evolves and changes
continuously
 Many initiatives aim to archive the Web
 Petabytes of archived data

Web Archiving 2/2
 Web archives are incomplete
 Impossible to include all Web pages due to
crawling limitations e.g., [Masanès06]
 Depth-first crawl, focus only on selected web sites
 Breadth-first crawl, focus on the entire domain,
but not in depth

Reconstruct Queries
 Our study: evolution of anchor text over time
to reconstruct what was important in the past
 Information that would be similar to user queries
 Inspiration:
 Document titles can be used as an approximation
of user queries [Jin et al.]
 Anchor text exhibits characteristics similar to user
query and document title [Eiron & McCurley]

Queries in the Past
 User queries have usually not been preserved
 Impossible to reconstruct which queries the
user would have used to search the archive
 However, web archives contain more than the
Web page content
 E.g., page source, different timestamps (archive
date, last-modified date), link structure

Link evidence and anchor Text
 Link information represents the source URL,
destination URL, and the anchor text
 Anchor text is a short text describing the
destination page
 Has been shown to improve search effectiveness in a
large number of Information Retrieval studies
`
Source
http://www.cwi.nl
Destination
http://www.nwo.nl
‘NWO’

Data: Dutch Web Archive
 National Library of the Netherlands (KB)
 Depth-first (selective) Web archive
 Since 2007
 10+ TB
 8,000+ websites
 Our snapshot
 2009-2012

Link Processing
Filtering  text/html pages
 ~70% of archived
objects
URL: http://www.cwi.nl
Archive-Date: 20091201
Content-Type: text/html
<html>
<a href=http://www.nwo.nl> NWO </a>
</html>
Web Archive Record

Link Processing
 ~70% of archived objects
Extraction
 Source URL
<html>
</html>
Web Archive Record

Link Processing
Extraction
 Source URL
 Destination URL
<html>
<a href=http://www.nwo.nl >NWO </a>
</html>
Web Archive Record

Link Processing
Extraction
 Source URL
 Destination URL
 Anchor text
<html>
</html>
Web Archive Record

Link Processing
Extraction
 Source URL
 Destination URL
 Anchor text
 Archive-date
(YYYYMM)
<html>
</html>
Web Archive Record

Link Processing
Filtering
 Pages of type text/html
Extraction
 Source URL
 Destination URL
 Anchor text
 Crawl-date (YYYYMM)
 URL normalization; get host of
the source and the destination
 Clean spam e.g., rolex watches
Cleaning

Link Processing
Filtering
Extraction
 Source URL
 Destination URL
 Anchor text
Cleaning
 URL normalization; get host of the source
and the destination
Partitioning  Based on one-year and one-month granularity

Link Processing
Filtering
Extraction
 Source URL
 Destination URL
 Anchor text
Cleaning
 URL normalization; get host of the source
and the destination
Partitioning  Based on one-year and one-month granularity
Deduplication
 Remove duplicate links; due to crawling
frequency
 Same source, destination, and anchor text

Hosts Evolution
 Important hosts overtime
 Aggregate links based on the target host
 keep unique source hosts
 Multiple pages from same host linking to the same
target host are counted as one
 Rank hosts based on number of source hosts
linking to them

% of new hosts over the years
% New hosts in 2012 not
in {2009, 2010, and
2011}

Anchor Text Evolution
 Measure the importance of anchor text a over
time in time-partitioned links
 Aggregate by anchor text
 Compute the archive-based popularity
 Normalize by Maximum

% new anchor text over years
 Anchor text is new in specific partition if does
not appear in the previous partitions
 Based on one-year granularity
 59% new anchor text
 Based on one-month granularity
 34% new anchor text

WikiStats
 Views aggregation of Wikipedia (WP) pages
 From Jan 2008 to Jan 2015
 We focus on
 Feb 2009 to Dec 2012
 Similar to the period of our snapshot of the Dutch
Web archive
 Keep WP titles viewed >= 1,000 times

Matching anchor text to WP titles
 Pre-process WP titles like the anchor text
 Lowercase
 Stop-words removing
 One-year and one-month granularity partitions
 Collect titles by exact match with the anchors
 Assume anchor popularity equals WP page
popularity

Ranked anchor text with WP match
 Different rank cut-off
% overlap
decreases while
cut-off increases
~56 % in top-
1k has a match

Examples of popular anchor text (with match)
 Major cities in the Netherlands
 E.g., Amsterdam, Rotterdam, Groningen, and Utrecht
 Social web sites
 E.g., twitter, linkedin, flickr, and vimeo
 Major Dutch daily newspapers
 E.g., de Volkskrant, Telegraaf, and Trouw
 Dutch public broadcasting
 uitzending gemist
 Government web service
 E.g., belastingdienst

Discussion
 Our original goal was to identify historically
trending events from the link evolution
recorded in the archive
 Unfortunately we found only few examples
with our current analysis
 E.g., ‘‘canon’’ *
 However, important anchor text provides and
overview of important Dutch entities
* corresponding to an activity initiated by the government to define
the canonical historic events in Dutch history

Limitations & Future Work
 Exact text matching between anchor text and
WP title
 E.g., filmpje does not match WP title filmpje!
 Additional pre-processing
 Stemming, stopping, generalize from exact match to
match with low edit distance
 Our analysis is based on depth-first crawl of
few thousand of Dutch websites
 Breadth-first crawl such as [CommonCrawl]

References
 [Masanés06] J. Masanés. Web Archiving. Springer, 2006
 [Jin et al.] Rong Jin, Alexander G. Hauptmann, and ChengXiang Zhai.
Title language model for information retrieval. In SIGIR 2002
 Eiron & McCurley Nadav Eiron and Kevin S. McCurley. Analysis of
anchor text for web search. In SIGIR 2003
 [CommonCrawl] https://commoncrawl.org/
 [WikiStats] http://wikistats.ins.cwi.nl/

Temporal Anchor Text as Proxy for user Queries

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Ähnlich wie Temporal Anchor Text as Proxy for user Queries

Ähnlich wie Temporal Anchor Text as Proxy for user Queries (20)

Kürzlich hochgeladen

Kürzlich hochgeladen (20)

Temporal Anchor Text as Proxy for user Queries