Barbara Poblete, Aristides Gionis, Carlos Castillo: \"Dr. Searcher and Mr. Browser: a unified hyperlink-click graph\". Proceedings of CIKM, Napa Valley, CA, USA, October 2008.
Empowering Africa's Next Generation: The AI Leadership Blueprint
Dr. Searcher and Mr. Browser: A unified hyperlink-click graph
1. Dr. Searcher and Mr. Browser:
A unified hyperlink-click graph
Barbara Poblete, Carlos Castillo, Aristides Gionis
University Pompeu Fabra, Yahoo! Research
Barcelona, Spain
CIKM 2008
2. In this talk
New unified web graph: hyperlink-click graph
Union of the hyperlink graph and the click graph
Random walk generates robust results that match or are
better than the results obtained on either individual graph
3. Motivation
Two types of web graphs: click and hyperlink graph
Represent two most common tasks of users:
searching and browsing
Correspond to two prototypical ways of looking for
information: asking and exploring
Edges capture semantic relations among nodes:
e.g., similarity and authority endorsement
4. Motivation...
Drawbacks:
§ Hyperlink graph: adversarial increase in PageRank scores,
i.e., spam or link farms
§ Click graph: sparsity, inherent bias in web-search engine
ranking, dependency on textual match and click spam
Our claim: Hyperlink and click graphs are complementary and
can be used to alleviate the shortcomings of each other
5. Motivation...
Drawbacks:
§ Hyperlink graph: adversarial increase in PageRank scores,
i.e., spam or link farms
§ Click graph: sparsity, inherent bias in web-search engine
ranking, dependency on textual match and click spam
Our claim: Hyperlink and click graphs are complementary and
can be used to alleviate the shortcomings of each other
6. Contribution
Introduce the hyperlink-click graph,
union of hyperlink and click graph
Consider random walk on the hyperlink-click graph
1 Model user behavior more accurately
2 Show that ranking with the hyperlink-click graph is similar
to the best performance by either of the two graphs
3 The unified graph compensates where either graph
performs poorly
4 more robust and fail-safe
7. Applications
Uses of the hyperlink-click graph:
Ranking of documents
Query ranking and query recommendation
Similarity search
Spam detection
9. Random walks on web graphs
Random walk on the hyperlink graph
PH = αNH + (1 − α)1H
Random walk on the click graph
PC = αNC + (1 − α)1C
Random walk on the hyperlink-click graph
PHC = αβNC + α(1 − β)NH + (1 − α)1
10. Random walks on web graphs
Random walk on the hyperlink graph
PH = αNH + (1 − α)1H
Random walk on the click graph
PC = αNC + (1 − α)1C
Random walk on the hyperlink-click graph
PHC = αβNC + α(1 − β)NH + (1 − α)1
11. Random walks on web graphs
Random walk on the hyperlink graph
PH = αNH + (1 − α)1H
Random walk on the click graph
PC = αNC + (1 − α)1C
Random walk on the hyperlink-click graph
PHC = αβNC + α(1 − β)NH + (1 − α)1
13. Evaluation and results
Validate utility of the random walk scores on the
hyperlink-click graph
Compare scores with those produced from the hyperlink
and the click graph
14. Evaluation and results...
We focus on two tasks in which a good ranking method
should perform well:
ranking high-quality documents and
ranking pairs of documents
The evaluation is centered on analyzing the dissimilarities
among the different models
15. Evaluation and results...
Dataset:
Yahoo! query log
9 K seed documents extracted from the query log
- documents with at least 10 clicks
61 K queries with at least one click to the seed documents
Using a web crawl expanded to 144 million documents
The expansion considers all in- and out-neighbors of all
documents in the seed set seed set.
16. Evaluation and results...
Two types of data:
Without sponsored results: query log only with
algorithmic results
With sponsored results:
17. Evaluation and results...
Random walk evaluation
we compute scores on the complete datasets, but
we evaluate only on documents in the intersection of the
three graphs
Two tasks:
1 ranking high-quality documents (dmoz),
2 ranking pairs of documents (user evaluation).
18. Evaluation and results...
dmoz:
ΠZ : Our first measure is the normalized sum of the π scores
of DZ documents.
d∈DZπ(d)
ΠZ =
d∈DC π(d)
ΓZ : Goodman-Kruskal Gamma Γ measure between the
rankings
D −A
Γ=
D +A
19. Evaluation and results...
User study:
13 users
1 710 accessments
users expressed not neutral opinion in 32% of document
pairs
Use Γ to measure pairs of documents on which rankings
agree with users
20. Evaluation and results...
dmoz:
Macro-evaluation: captures the overall scores of
high-quality documents
ΠZ and ΓZ are computed considering all the documents in
the evaluation set
Micro-evaluation: at query level
ΠZ and ΓZ in the evaluation set are reduced to only those
documents clicked from a particular query.
(User study: micro-evaluation)
21. Evaluation and results...
dmoz:
Macro-evaluation: captures the overall scores of
high-quality documents
ΠZ and ΓZ are computed considering all the documents in
the evaluation set
Micro-evaluation: at query level
ΠZ and ΓZ in the evaluation set are reduced to only those
documents clicked from a particular query.
(User study: micro-evaluation)
25. Evaluation and results...
Summary of ΓZ values for dmoz:
1.000
click graph
hyperlink-click graph
hyperlink graph
0.800
0.600
0.400
value
0.200
0.000
-0.200
Macro (with variations) Macro (no variations) Micro (with variations) Micro (no variations)
26. Evaluation and results...
Summary of ΠZ values for dmoz:
0.900
click graph
hyperlink-click graph
0.800 hyperlink graph
0.700
0.600
value
0.500
0.400
0.300
0.200
Macro (with variations) Macro (no variations) Micro (with variations) Micro (no variations)
27. Conclusions
We study random walk on a unified web graph
Intended to model user searching and browsing behavior
Several combinations of metrics are used for evaluation
Experimental evaluation shows:
The unified graph is always close to the best performance
of either the click or the hyperlink graph
28. Future work
Analyze how to deal with the inherent bias that exists in
any ranking technique based on usage mining
Study other web mining applications
1 Link and click spam detection
2 Similarity search
3 Query recommendation