1. Link Analysis for
Web Information
Retrieval
C. Castillo
Hypothesis
Link Analysis for Web Information Retrieval
Levels of link
analysis
With Applications to Adversarial IR
Ranking
Web spam
Carlos Castillo1
... detection
... links
chato@yahoo-inc.com
... contents
With: R. Baeza-Yates1,3 , L. Becchetti2 , P. Boldi5 ,
... both
D. Donato1 , A. Gionis1 , S. Leonardi2 , V.Murdock1 ,
Summary
M. Santini5 , F. Silvestri4 , S. Vigna5
1. Yahoo! Research Barcelona – Catalunya, Spain
2. Universit` di Roma “La Sapienza” – Rome, Italy
a
3. Yahoo! Research Santiago – Chile
4. ISTI-CNR –Pisa,Italy
5. Universit` degli Studi di Milano – Milan, Italy
a
2. Link Analysis for
When you have a hammer
Web Information
Retrieval
C. Castillo
Hypothesis
Levels of link
analysis
Ranking
Web spam
... detection
... links
... contents
... both
Summary
3. Link Analysis for
Everything looks like a graph!
Web Information
Retrieval
C. Castillo
Hypothesis
Levels of link
analysis
Ranking
Web spam
... detection
... links
... contents
... both
Summary
4. Link Analysis for
Web Information
Retrieval
C. Castillo
Hypothesis
Levels of link
Hypothesis
1
analysis
Levels of link analysis
2
Ranking
Ranking
3
Web spam
Web spam
4
... detection
... detection
5
... links
... links
6
... contents
7 ... contents
... both
8 ... both
Summary 9 Summary
5. Link Analysis for
Web Information
Retrieval
C. Castillo
Hypothesis
Levels of link
analysis
Links are not placed at random
Ranking
Web spam
... detection
... links
Topical locality hypothesis
... contents
Link endorsement hypothesis
... both
Summary
6. Link Analysis for
Web Information
Retrieval
C. Castillo
Hypothesis
Levels of link
analysis
Links are not placed at random
Ranking
Web spam
... detection
... links
Topical locality hypothesis
... contents
Link endorsement hypothesis
... both
Summary
7. Link Analysis for
Topical locality hypothesis
Web Information
Retrieval
C. Castillo
Hypothesis
Levels of link
analysis
Ranking
Web spam
“We found that pages are significantly more likely to
... detection
be related topically to pages to which they are
... links
linked, as opposed to other pages selected at
... contents
random or other nearby pages.” [Davison, 2000]
... both
Summary
8. Link Analysis for
Web Information
Retrieval
0.7
C. Castillo
Average text similarity
Hypothesis
0.6
Levels of link
analysis
Ranking
0.5
Web spam
... detection
0.4
... links
... contents
0.3
... both
Summary
0.2
1 2 3 4 5
Link distance
[Baeza-Yates et al., 2006], data from UK 2006
9. Link Analysis for
Link similarity cases
Web Information
Retrieval
C. Castillo
Hypothesis
Levels of link
analysis
Ranking
Web spam
... detection
Link (geodesic) distance
... links
Co-citation
... contents
Bibliographic coupling
... both
Summary
10. Link Analysis for
Co-citation
Web Information
Retrieval
C. Castillo
Hypothesis
Levels of link
analysis
Ranking
Web spam
... detection
... links
... contents
... both
Summary
11. Link Analysis for
Bibliographic coupling
Web Information
Retrieval
C. Castillo
Hypothesis
Levels of link
analysis
Ranking
Web spam
... detection
... links
... contents
... both
Summary
12. Link Analysis for
(Both can be generalized)
Web Information
Retrieval
C. Castillo
Hypothesis
Levels of link
analysis
Ranking
Web spam
... detection
(Both co-citation and bibliographic coupling can be
... links
generalized. E.g.: SimRank [Jeh and Widom, 2002]:
... contents
generalizes the idea of co-citation to several levels)
... both
Summary
13. Link Analysis for
Link endorsement hypothesis
Web Information
Retrieval
C. Castillo
Hypothesis
Levels of link
analysis
Links are assumed to be endorsements (votes, positive
Ranking
opinions) [Li, 1998]
Web spam
... detection
But they can represent:
... links
Disagreement
... contents
Self citations
... both
Summary
Nepotism
Citations to methodological documents
etc.
14. Link Analysis for
Link endorsement hypothesis
Web Information
Retrieval
C. Castillo
Hypothesis
Levels of link
analysis
Links are assumed to be endorsements (votes, positive
Ranking
opinions) [Li, 1998]
Web spam
... detection
But they can represent:
... links
Disagreement
... contents
Self citations
... both
Summary
Nepotism
Citations to methodological documents
etc.
15. Link Analysis for
Furthermore
Web Information
Retrieval
C. Castillo
Hypothesis
Levels of link
analysis
They measure quantity not quality (e.g.: “Stop the
Ranking
numbers game!” in ACM communications a few months
Web spam
ago)
... detection
Self-citations are frequent
... links
... contents
In some topics there is more linking
... both
Citations go from newer to older
Summary
New documents get few
citations [Baeza-Yates et al., 2002]
Many of the citations are irrelevant
16. Link Analysis for
Furthermore
Web Information
Retrieval
C. Castillo
Hypothesis
Levels of link
analysis
They measure quantity not quality (e.g.: “Stop the
Ranking
numbers game!” in ACM communications a few months
Web spam
ago)
... detection
Self-citations are frequent
... links
... contents
In some topics there is more linking
... both
Citations go from newer to older
Summary
New documents get few
citations [Baeza-Yates et al., 2002]
Many of the citations are irrelevant
17. Link Analysis for
Furthermore
Web Information
Retrieval
C. Castillo
Hypothesis
Levels of link
analysis
They measure quantity not quality (e.g.: “Stop the
Ranking
numbers game!” in ACM communications a few months
Web spam
ago)
... detection
Self-citations are frequent
... links
... contents
In some topics there is more linking
... both
Citations go from newer to older
Summary
New documents get few
citations [Baeza-Yates et al., 2002]
Many of the citations are irrelevant
18. Link Analysis for
Furthermore
Web Information
Retrieval
C. Castillo
Hypothesis
Levels of link
analysis
They measure quantity not quality (e.g.: “Stop the
Ranking
numbers game!” in ACM communications a few months
Web spam
ago)
... detection
Self-citations are frequent
... links
... contents
In some topics there is more linking
... both
Citations go from newer to older
Summary
New documents get few
citations [Baeza-Yates et al., 2002]
Many of the citations are irrelevant
19. Link Analysis for
Furthermore
Web Information
Retrieval
C. Castillo
Hypothesis
Levels of link
analysis
They measure quantity not quality (e.g.: “Stop the
Ranking
numbers game!” in ACM communications a few months
Web spam
ago)
... detection
Self-citations are frequent
... links
... contents
In some topics there is more linking
... both
Citations go from newer to older
Summary
New documents get few
citations [Baeza-Yates et al., 2002]
Many of the citations are irrelevant
20. Link Analysis for
Furthermore
Web Information
Retrieval
C. Castillo
Hypothesis
Levels of link
analysis
They measure quantity not quality (e.g.: “Stop the
Ranking
numbers game!” in ACM communications a few months
Web spam
ago)
... detection
Self-citations are frequent
... links
... contents
In some topics there is more linking
... both
Citations go from newer to older
Summary
New documents get few
citations [Baeza-Yates et al., 2002]
Many of the citations are irrelevant
21. Link Analysis for
Nevertheless
Web Information
Retrieval
C. Castillo
Hypothesis
Levels of link
analysis
Ranking
Both the topical locality hypothesis and the link endorsement
Web spam
hypothesis are meaningful on the Web
... detection
... links
Analogy with Economy
... contents
... both
Think on the hypothesis requiring many buyers/sellers, zero
Summary
transaction costs, perfect information, etc. in economic
sciences
22. Link Analysis for
Nevertheless
Web Information
Retrieval
C. Castillo
Hypothesis
Levels of link
analysis
Ranking
Both the topical locality hypothesis and the link endorsement
Web spam
hypothesis are meaningful on the Web
... detection
... links
Analogy with Economy
... contents
... both
Think on the hypothesis requiring many buyers/sellers, zero
Summary
transaction costs, perfect information, etc. in economic
sciences
23. Link Analysis for
Web Information
Retrieval
C. Castillo
Hypothesis
Levels of link
Hypothesis
1
analysis
Levels of link analysis
2
Ranking
Ranking
3
Web spam
Web spam
4
... detection
... detection
5
... links
... links
6
... contents
7 ... contents
... both
8 ... both
Summary 9 Summary
24. Link Analysis for
Web Information
Retrieval
C. Castillo
Hypothesis
Levels of link
analysis
Ranking
Web spam
... detection
... links
... contents
... both
Summary
25. Link Analysis for
How to find meaningful patterns?
Web Information
Retrieval
C. Castillo
Hypothesis
Levels of link
analysis
Ranking
Web spam
Several levels of analysis:
... detection
Macroscopic view: overall structure
... links
... contents
Microscopic view: nodes
... both
Mesoscopic view: regions
Summary
26. Link Analysis for
How to find meaningful patterns?
Web Information
Retrieval
C. Castillo
Hypothesis
Levels of link
analysis
Ranking
Web spam
Several levels of analysis:
... detection
Macroscopic view: overall structure
... links
... contents
Microscopic view: nodes
... both
Mesoscopic view: regions
Summary
27. Link Analysis for
How to find meaningful patterns?
Web Information
Retrieval
C. Castillo
Hypothesis
Levels of link
analysis
Ranking
Web spam
Several levels of analysis:
... detection
Macroscopic view: overall structure
... links
... contents
Microscopic view: nodes
... both
Mesoscopic view: regions
Summary
28. Link Analysis for
Macroscopic view, e.g. Bow-tie
Web Information
Retrieval
C. Castillo
Hypothesis
Levels of link
analysis
Ranking
Web spam
... detection
... links
... contents
... both
Summary
[Broder et al., 2000]
29.
30. Link Analysis for
Macroscopic view, e.g. Jellyfish
Web Information
Retrieval
C. Castillo
Hypothesis
Levels of link
analysis
Ranking
Web spam
... detection
... links
... contents
... both
Summary
[Tauro et al., 2001] - Internet Autonomous Systems (AS)
Topology
31. Link Analysis for
Macroscopic view, e.g. Jellyfish
Web Information
Retrieval
C. Castillo
Hypothesis
Levels of link
analysis
Ranking
Web spam
... detection
... links
... contents
... both
Summary
32. Link Analysis for
Microscopic view, e.g. Degree
Web Information
Retrieval
C. Castillo
Hypothesis
Levels of link
analysis
Ranking
Web spam
... detection
... links
... contents
... both
Summary
[Barab´si, 2002] and others
a
33. Link Analysis for
Web Information
Retrieval
C. Castillo
Hypothesis
Levels of link
analysis
Ranking
Web spam
... detection
... links
... contents
... both
Summary
“While entirely of human design, the emerging
network appears to have more in common with a cell
or an ecological system than with a Swiss
watch.” [Barab´si, 2002]
a
34. Link Analysis for
Other scale-free networks
Web Information
Retrieval
C. Castillo
Hypothesis
Levels of link
analysis
Ranking
Web spam
Power grid designs
... detection
Sexual partners in humans
... links
Collaboration of movie actors in films
... contents
... both
Citations in scientific publications
Summary
Protein interactions
35. Link Analysis for
Microscopic view, e.g. Degree
Web Information
Retrieval
C. Castillo
Greece Chile
Hypothesis
Levels of link
analysis
Ranking
Web spam
... detection
... links
... contents
Spain Korea
... both
Summary
[Baeza-Yates et al., 2007] - compares this distribution in 8
countries . . . guess what is the result?
36. Link Analysis for
Mesoscopic view, e.g. Hop-plot
Web Information
Retrieval
C. Castillo
Hypothesis
Levels of link
analysis
Ranking
Web spam
... detection
... links
... contents
... both
Summary
37. Link Analysis for
Mesoscopic view, e.g. Hop-plot
Web Information
Retrieval
C. Castillo
Hypothesis
Levels of link
analysis
Ranking
Web spam
... detection
... links
... contents
... both
Summary
38. Link Analysis for
Mesoscopic view, e.g. Hop-plot
Web Information
Retrieval
C. Castillo
.it (40M pages) .uk (18M pages)
Hypothesis
0.3 0.3
Levels of link
analysis
0.2 0.2
Ranking
Frequency
Frequency
Web spam
0.1 0.1
... detection
... links 0.0 0.0
5 10 15 20 25 30 5 10 15 20 25 30
... contents Distance Distance
.eu.int (800K pages) Synthetic graph (100K pages)
... both
Summary 0.3 0.3
0.2 0.2
Frequency
Frequency
0.1 0.1
0.0 0.0
5 10 15 20 25 30 5 10 15 20 25 30
Distance Distance
[Baeza-Yates et al., 2006]
39. Link Analysis for
Web Information
Retrieval
C. Castillo
Hypothesis
Levels of link
analysis
Ranking
Web spam
... detection
... links
... contents
... both
Summary
40. Link Analysis for
Models
Web Information
Retrieval
C. Castillo
Hypothesis
Levels of link
analysis
Ranking
Web spam
... detection
Preferential attachment
... links
Copy model
... contents
Hybrid models
... both
Summary
41. Link Analysis for
Models
Web Information
Retrieval
C. Castillo
Hypothesis
Levels of link
analysis
Ranking
Web spam
... detection
Preferential attachment
... links
Copy model
... contents
Hybrid models
... both
Summary
42. Link Analysis for
Models
Web Information
Retrieval
C. Castillo
Hypothesis
Levels of link
analysis
Ranking
Web spam
... detection
Preferential attachment
... links
Copy model
... contents
Hybrid models
... both
Summary
43. Link Analysis for
Preferential attachment
Web Information
Retrieval
C. Castillo
Hypothesis
Levels of link
analysis
“A common property of many large networks is that
Ranking
the vertex connectivities follow a scale-free
Web spam
power-law distribution. This feature was found to be
... detection
a consequence of two generic mechanisms: (i)
... links
networks expand continuously by the addition of
... contents
new vertices, and (ii) new vertices attach
... both
preferentially to sites that are already well
Summary
connected.” [Barab´si and Albert, 1999]
a
“rich get richer”
44. Link Analysis for
Preferential attachment
Web Information
Retrieval
C. Castillo
Hypothesis
Levels of link
analysis
“A common property of many large networks is that
Ranking
the vertex connectivities follow a scale-free
Web spam
power-law distribution. This feature was found to be
... detection
a consequence of two generic mechanisms: (i)
... links
networks expand continuously by the addition of
... contents
new vertices, and (ii) new vertices attach
... both
preferentially to sites that are already well
Summary
connected.” [Barab´si and Albert, 1999]
a
“rich get richer”
45. Link Analysis for
Web Information
Retrieval
C. Castillo
Hypothesis
Levels of link
Hypothesis
1
analysis
Levels of link analysis
2
Ranking
Ranking
3
Web spam
Web spam
4
... detection
... detection
5
... links
... links
6
... contents
7 ... contents
... both
8 ... both
Summary 9 Summary
46. Link Analysis for
Counting in-links does not work
Web Information
Retrieval
C. Castillo
Hypothesis
Levels of link
analysis
“With a simple program, huge numbers of pages can
Ranking
Web spam
be created easily, artificially inflating citation counts.
... detection
Because the Web environment contains profit
... links
seeking ventures, attention getting strategies evolve
... contents
in response to search engine algorithms. For this
... both
reason, any evaluation strategy which counts
Summary
replicable features of web pages is prone to
manipulation” [Page et al., 1998]
47. Link Analysis for
PageRank: simplified version
Web Information
Retrieval
C. Castillo
Hypothesis
Levels of link
analysis
Ranking
Web spam
PageRank ′ (v )
PageRank ′ (u) =
... detection
|Γ+ (v )|
... links
v ∈Γ− (u)
... contents
... both
Γ− (·): in-links
Summary
Γ+ (·): out-links
48. Link Analysis for
Iterations with pseudo-PageRank
Web Information
Retrieval
C. Castillo
Hypothesis
Levels of link
analysis
Ranking
Web spam
... detection
... links
... contents
... both
Summary
49. Link Analysis for
Iterations with pseudo-PageRank
Web Information
Retrieval
C. Castillo
Hypothesis
Levels of link
analysis
Ranking
Web spam
... detection
... links
... contents
... both
Summary
50. Link Analysis for
So far, so good, but ...
Web Information
Retrieval
C. Castillo
Hypothesis
Levels of link
analysis
Ranking
Web spam
The Web includes many pages with no out-links, these
... detection
will accumulate all of the score
... links
... contents
We would like Web pages to accumulate ranking
... both
We add random jumps (teleportation)
Summary
51. Link Analysis for
PageRank
Web Information
Retrieval
C. Castillo
Hypothesis
Levels of link
analysis
Ranking
Web spam
ǫ PageRank(v )
PageRank(u) = + (1 − ǫ)
... detection
|Γ+ (v )|
N
v ∈Γ− (u)
... links
... contents
... both
Γ− (·): in-links
Summary
Γ+ (·): out-links
ǫ/N: jump to a random page with probability ǫ ≈ 0.15
52. Link Analysis for
HITS
Web Information
Retrieval
C. Castillo
Hypothesis
Levels of link
analysis
Ranking
Web spam
... detection
... links
... contents
... both
Summary
Two scores per page: “hub score” and “authority score”.
53. Link Analysis for
HITS
Web Information
Retrieval
C. Castillo
Hypothesis
Levels of link
analysis
Ranking
Web spam
... detection
... links
... contents
... both
Summary
Two scores per page: “hub score” and “authority score”.
54. Link Analysis for
Web Information
Retrieval
C. Castillo
Hypothesis
Levels of link
analysis
Ranking
Web spam
... detection
... links
... contents
... both
Summary
55. Link Analysis for
Iterations
Web Information
Retrieval
C. Castillo
Hypothesis
Levels of link
analysis
Ranking
Initialize:
Web spam
hub(u, 0) = auth(u, 0) = 0
... detection
... links
... contents
Iterate:
... both auth(v ,t−1)
hub(u, t) = v ∈Γ+ (u) |Γ− (v )|
Summary
hub(v ,t−1)
auth(u, t) = |Γ+ (v )|
v ∈Γ− (u)
56. Link Analysis for
Web Information
Retrieval
C. Castillo
Hypothesis
Levels of link
Hypothesis
1
analysis
Levels of link analysis
2
Ranking
Ranking
3
Web spam
Web spam
4
... detection
... detection
5
... links
... links
6
... contents
7 ... contents
... both
8 ... both
Summary 9 Summary
57. Link Analysis for
What is on the Web?
Web Information
Retrieval
C. Castillo
Hypothesis
Levels of link
analysis
Ranking
Web spam
... detection
... links
... contents
... both
Summary
58. Link Analysis for
What is on the Web [2.0]?
Web Information
Retrieval
C. Castillo
Hypothesis
Levels of link
analysis
Ranking
Web spam
... detection
... links
... contents
... both
Summary
59. Link Analysis for
What else is on the Web?
Web Information
Retrieval
C. Castillo
“The sum of all human knowledge plus porn” – Robert Gilbert
Hypothesis
Levels of link
analysis
Ranking
Web spam
... detection
... links
... contents
... both
Summary
60. Link Analysis for
What’s happening on the Web?
Web Information
Retrieval
C. Castillo
Hypothesis
Levels of link
analysis
Ranking
Web spam
There is a fierce competition
... detection
... links
... contents
for your attention
... both
Summary
61. Link Analysis for
What’s happening on the Web?
Web Information
Retrieval
C. Castillo
Hypothesis
Levels of link
analysis
Ranking
Web spam
Search engines are to some extent
... detection
... links
arbiters of this competition
... contents
... both
and they must watch it closely, otherwise ...
Summary
62. Link Analysis for
Some cheating occurs
Web Information
Retrieval
C. Castillo
Hypothesis
Levels of link
analysis
Ranking
Web spam
... detection
... links
... contents
... both
Summary
1986 FIFA World Cup, Argentina vs England
63. Link Analysis for
Simple web spam
Web Information
Retrieval
C. Castillo
Hypothesis
Levels of link
analysis
Ranking
Web spam
... detection
... links
... contents
... both
Summary
64. Link Analysis for
Hidden text
Web Information
Retrieval
C. Castillo
Hypothesis
Levels of link
analysis
Ranking
Web spam
... detection
... links
... contents
... both
Summary
65. Link Analysis for
Made for advertising
Web Information
Retrieval
C. Castillo
Hypothesis
Levels of link
analysis
Ranking
Web spam
... detection
... links
... contents
... both
Summary
66. Link Analysis for
Search engine?
Web Information
Retrieval
C. Castillo
Hypothesis
Levels of link
analysis
Ranking
Web spam
... detection
... links
... contents
... both
Summary
67. Link Analysis for
Fake search engine
Web Information
Retrieval
C. Castillo
Hypothesis
Levels of link
analysis
Ranking
Web spam
... detection
... links
... contents
... both
Summary
68. Link Analysis for
“Normal” content in link farms
Web Information
Retrieval
C. Castillo
Hypothesis
Levels of link
analysis
Ranking
Web spam
... detection
... links
... contents
... both
Summary
69. Link Analysis for
“Normal” content in link farms
Web Information
Retrieval
C. Castillo
Hypothesis
Levels of link
analysis
Ranking
Web spam
... detection
... links
... contents
... both
Summary
70. Link Analysis for
Cloaking
Web Information
Retrieval
C. Castillo
Hypothesis
Levels of link
analysis
Ranking
Web spam
... detection
... links
... contents
... both
Summary
71. Link Analysis for
Redirection
Web Information
Retrieval
C. Castillo
Hypothesis
Levels of link
analysis
Ranking
Web spam
... detection
... links
... contents
... both
Summary
72. Link Analysis for
Redirects using Javascript
Web Information
Retrieval
C. Castillo
Hypothesis
Simple redirect
Levels of link
analysis
<script>
Ranking
document.location=quot;http://www.topsearch10.com/quot;;
Web spam
</script>
... detection
... links
“Hidden” redirect
... contents
... both
<script>
Summary
var1=24; var2=var1;
if(var1==var2) {
document.location=quot;http://www.topsearch10.com/quot;;
}
</script>
73. Link Analysis for
Problem: obfuscated code
Web Information
Retrieval
C. Castillo
Hypothesis
Levels of link
Obfuscated redirect
analysis
Ranking
<script>
Web spam
var a1=quot;winquot;,a2=quot;dowquot;,a3=quot;locaquot;,a4=quot;tion.quot;,
... detection
a5=quot;replacequot;,a6=quot;(’http://www.top10search.com/’)quot;;
... links
var i,str=quot;quot;;
... contents
for(i=1;i<=6;i++)
... both
{
Summary
str += eval(quot;aquot;+i);
}
eval(str);
</script>
74. Link Analysis for
Problem: really obfuscated code
Web Information
Retrieval
C. Castillo
Hypothesis
Levels of link
analysis
Ranking
Encoded javascript
Web spam
<script>
... detection
var s = quot;%5CBE0D%5C%05GDHJ BDE%16...%04%0Equot;;
... links
var e = ’’, i;
... contents
eval(unescape(’s%eDunescape%28s%29%3Bfor...%3B’));
... both
Summary
</script>
More examples: [Chellapilla and Maykov, 2007]
75. Link Analysis for
There are many attempts of cheating on the Web
Web Information
Retrieval
C. Castillo
Hypothesis
Levels of link
analysis
Ranking
Web spam
Most of these are spam:
... detection
1,630,000 results for “free mp3 hilton viagra” in SE1
... links
... contents
1,760,000 results for “credit vicodin loan” in SE2
... both
1,320,000 results for “porn mortgage” in SE3
Summary
76. Link Analysis for
Costs
Web Information
Retrieval
C. Castillo
Hypothesis
Levels of link
analysis
Ranking
Costs:
Web spam
X Costs for users: lower precision for some queries
... detection
... links
X Costs for search engines: wasted storage space,
... contents
network resources, and processing cycles
... both
X Costs for the publishers: resources invested in cheating
Summary
and not in improving their contents
77. Link Analysis for
Adversarial IR Issues on the Web
Web Information
Retrieval
C. Castillo
Hypothesis
Link spam
Levels of link
Content spam
analysis
Ranking
Cloaking
Web spam
Comment/forum/wiki spam
... detection
Spam-oriented blogging
... links
... contents
Click fraud ×2
... both
Reverse engineering of ranking algorithms
Summary
Web content filtering
Advertisement blocking
Stealth crawling
Malicious tagging
. . . more?
78. Link Analysis for
Adversarial IR Issues on the Web
Web Information
Retrieval
C. Castillo
Hypothesis
Link spam
Levels of link
Content spam
analysis
Ranking
Cloaking
Web spam
Comment/forum/wiki spam
... detection
Spam-oriented blogging
... links
... contents
Click fraud ×2
... both
Reverse engineering of ranking algorithms
Summary
Web content filtering
Advertisement blocking
Stealth crawling
Malicious tagging
. . . more?
79. Link Analysis for
Adversarial IR Issues on the Web
Web Information
Retrieval
C. Castillo
Hypothesis
Link spam
Levels of link
Content spam
analysis
Ranking
Cloaking
Web spam
Comment/forum/wiki spam
... detection
Spam-oriented blogging
... links
... contents
Click fraud ×2
... both
Reverse engineering of ranking algorithms
Summary
Web content filtering
Advertisement blocking
Stealth crawling
Malicious tagging
. . . more?
80. Link Analysis for
Adversarial IR Issues on the Web
Web Information
Retrieval
C. Castillo
Hypothesis
Link spam
Levels of link
Content spam
analysis
Ranking
Cloaking
Web spam
Comment/forum/wiki spam
... detection
Spam-oriented blogging
... links
... contents
Click fraud ×2
... both
Reverse engineering of ranking algorithms
Summary
Web content filtering
Advertisement blocking
Stealth crawling
Malicious tagging
. . . more?
81. Link Analysis for
Adversarial IR Issues on the Web
Web Information
Retrieval
C. Castillo
Hypothesis
Link spam
Levels of link
Content spam
analysis
Ranking
Cloaking
Web spam
Comment/forum/wiki spam
... detection
Spam-oriented blogging
... links
... contents
Click fraud ×2
... both
Reverse engineering of ranking algorithms
Summary
Web content filtering
Advertisement blocking
Stealth crawling
Malicious tagging
. . . more?
82. Link Analysis for
Adversarial IR Issues on the Web
Web Information
Retrieval
C. Castillo
Hypothesis
Link spam
Levels of link
Content spam
analysis
Ranking
Cloaking
Web spam
Comment/forum/wiki spam
... detection
Spam-oriented blogging
... links
... contents
Click fraud ×2
... both
Reverse engineering of ranking algorithms
Summary
Web content filtering
Advertisement blocking
Stealth crawling
Malicious tagging
. . . more?
83. Link Analysis for
Adversarial IR Issues on the Web
Web Information
Retrieval
C. Castillo
Hypothesis
Link spam
Levels of link
Content spam
analysis
Ranking
Cloaking
Web spam
Comment/forum/wiki spam
... detection
Spam-oriented blogging
... links
... contents
Click fraud ×2
... both
Reverse engineering of ranking algorithms
Summary
Web content filtering
Advertisement blocking
Stealth crawling
Malicious tagging
. . . more?
84. Link Analysis for
Adversarial IR Issues on the Web
Web Information
Retrieval
C. Castillo
Hypothesis
Link spam
Levels of link
Content spam
analysis
Ranking
Cloaking
Web spam
Comment/forum/wiki spam
... detection
Spam-oriented blogging
... links
... contents
Click fraud ×2
... both
Reverse engineering of ranking algorithms
Summary
Web content filtering
Advertisement blocking
Stealth crawling
Malicious tagging
. . . more?
85. Link Analysis for
Adversarial IR Issues on the Web
Web Information
Retrieval
C. Castillo
Hypothesis
Link spam
Levels of link
Content spam
analysis
Ranking
Cloaking
Web spam
Comment/forum/wiki spam
... detection
Spam-oriented blogging
... links
... contents
Click fraud ×2
... both
Reverse engineering of ranking algorithms
Summary
Web content filtering
Advertisement blocking
Stealth crawling
Malicious tagging
. . . more?
86. Link Analysis for
Adversarial IR Issues on the Web
Web Information
Retrieval
C. Castillo
Hypothesis
Link spam
Levels of link
Content spam
analysis
Ranking
Cloaking
Web spam
Comment/forum/wiki spam
... detection
Spam-oriented blogging
... links
... contents
Click fraud ×2
... both
Reverse engineering of ranking algorithms
Summary
Web content filtering
Advertisement blocking
Stealth crawling
Malicious tagging
. . . more?
87. Link Analysis for
Adversarial IR Issues on the Web
Web Information
Retrieval
C. Castillo
Hypothesis
Link spam
Levels of link
Content spam
analysis
Ranking
Cloaking
Web spam
Comment/forum/wiki spam
... detection
Spam-oriented blogging
... links
... contents
Click fraud ×2
... both
Reverse engineering of ranking algorithms
Summary
Web content filtering
Advertisement blocking
Stealth crawling
Malicious tagging
. . . more?
88. Link Analysis for
Adversarial IR Issues on the Web
Web Information
Retrieval
C. Castillo
Hypothesis
Link spam
Levels of link
Content spam
analysis
Ranking
Cloaking
Web spam
Comment/forum/wiki spam
... detection
Spam-oriented blogging
... links
... contents
Click fraud ×2
... both
Reverse engineering of ranking algorithms
Summary
Web content filtering
Advertisement blocking
Stealth crawling
Malicious tagging
. . . more?
89. Link Analysis for
Opportunities for Web spam
Web Information
Retrieval
C. Castillo
Hypothesis
Levels of link
analysis
X Spamdexing
Ranking
Keyword stuffing
Web spam
Link farms
... detection
Spam blogs (splogs)
... links
Cloaking
... contents
... both
Adversarial relationship
Summary
Every undeserved gain in ranking for a spammer, is a loss of
precision for the search engine.
90. Link Analysis for
Opportunities for Web spam
Web Information
Retrieval
C. Castillo
Hypothesis
Levels of link
analysis
X Spamdexing
Ranking
Keyword stuffing
Web spam
Link farms
... detection
Spam blogs (splogs)
... links
Cloaking
... contents
... both
Adversarial relationship
Summary
Every undeserved gain in ranking for a spammer, is a loss of
precision for the search engine.
91. Link Analysis for
Web Information
Retrieval
C. Castillo
Hypothesis
Levels of link
Hypothesis
1
analysis
Levels of link analysis
2
Ranking
Ranking
3
Web spam
Web spam
4
... detection
... detection
5
... links
... links
6
... contents
7 ... contents
... both
8 ... both
Summary 9 Summary
92. Link Analysis for
Motivation
Web Information
Retrieval
C. Castillo
Hypothesis
Levels of link
analysis
Ranking
[Fetterly et al., 2004] hypothesized that studying the
Web spam
distribution of statistics about pages could be a good way of
... detection
detecting spam pages:
... links
... contents
“in a number of these distributions, outlier values are
... both
associated with web spam”
Summary
93. Link Analysis for
Machine Learning
Web Information
Retrieval
C. Castillo
Hypothesis
Levels of link
analysis
Ranking
Web spam
... detection
... links
... contents
... both
Summary
94. Link Analysis for
Training of a Decision Tree
Web Information
Retrieval
C. Castillo
Hypothesis
Levels of link
analysis
Ranking
Web spam
... detection
... links
... contents
... both
Summary
95. Link Analysis for
Decision Tree (error = 15%)
Web Information
Retrieval
C. Castillo
Hypothesis
Levels of link
analysis
Ranking
Web spam
... detection
... links
... contents
... both
Summary
96. Link Analysis for
Decision Tree (error = 15% → 12%)
Web Information
Retrieval
C. Castillo
Hypothesis
Levels of link
analysis
Ranking
Web spam
... detection
... links
... contents
... both
Summary
97. Link Analysis for
Machine Learning (cont.)
Web Information
Retrieval
C. Castillo
Hypothesis
Levels of link
analysis
Ranking
Web spam
... detection
... links
... contents
... both
Summary
98. Link Analysis for
Feature Extraction
Web Information
Retrieval
C. Castillo
Hypothesis
Levels of link
analysis
Ranking
Web spam
... detection
... links
... contents
... both
Summary
99. Link Analysis for
Challenges: Machine Learning
Web Information
Retrieval
C. Castillo
Hypothesis
Levels of link
analysis
Ranking
Web spam
Machine Learning Challenges:
... detection
Instances are not really independent (graph)
... links
... contents
Learning with few examples
... both
Scalability
Summary
100. Link Analysis for
Challenges: Machine Learning
Web Information
Retrieval
C. Castillo
Hypothesis
Levels of link
analysis
Ranking
Web spam
Machine Learning Challenges:
... detection
Instances are not really independent (graph)
... links
... contents
Learning with few examples
... both
Scalability
Summary
101. Link Analysis for
Challenges: Machine Learning
Web Information
Retrieval
C. Castillo
Hypothesis
Levels of link
analysis
Ranking
Web spam
Machine Learning Challenges:
... detection
Instances are not really independent (graph)
... links
... contents
Learning with few examples
... both
Scalability
Summary
102. Link Analysis for
Challenges: Information Retrieval
Web Information
Retrieval
C. Castillo
Hypothesis
Levels of link
analysis
Ranking
Information Retrieval Challenges:
Web spam
Feature extraction: which features?
... detection
... links
Feature aggregation: page/host/domain
... contents
Feature propagation (graph)
... both
Recall/precision tradeoffs
Summary
Scalability
103. Link Analysis for
Challenges: Information Retrieval
Web Information
Retrieval
C. Castillo
Hypothesis
Levels of link
analysis
Ranking
Information Retrieval Challenges:
Web spam
Feature extraction: which features?
... detection
... links
Feature aggregation: page/host/domain
... contents
Feature propagation (graph)
... both
Recall/precision tradeoffs
Summary
Scalability
104. Link Analysis for
Challenges: Information Retrieval
Web Information
Retrieval
C. Castillo
Hypothesis
Levels of link
analysis
Ranking
Information Retrieval Challenges:
Web spam
Feature extraction: which features?
... detection
... links
Feature aggregation: page/host/domain
... contents
Feature propagation (graph)
... both
Recall/precision tradeoffs
Summary
Scalability
105. Link Analysis for
Challenges: Information Retrieval
Web Information
Retrieval
C. Castillo
Hypothesis
Levels of link
analysis
Ranking
Information Retrieval Challenges:
Web spam
Feature extraction: which features?
... detection
... links
Feature aggregation: page/host/domain
... contents
Feature propagation (graph)
... both
Recall/precision tradeoffs
Summary
Scalability
106. Link Analysis for
Challenges: Information Retrieval
Web Information
Retrieval
C. Castillo
Hypothesis
Levels of link
analysis
Ranking
Information Retrieval Challenges:
Web spam
Feature extraction: which features?
... detection
... links
Feature aggregation: page/host/domain
... contents
Feature propagation (graph)
... both
Recall/precision tradeoffs
Summary
Scalability
107. Link Analysis for
Challenges: Data
Web Information
Retrieval
C. Castillo
Hypothesis
Levels of link
analysis
Ranking
Web spam
Data is difficult to collect
... detection
Data is expensive to label
... links
... contents
Labels are sparse
... both
Humans do not always agree
Summary
108. Link Analysis for
Agreement
Web Information
Retrieval
C. Castillo
Hypothesis
Levels of link
analysis
Ranking
Web spam
... detection
... links
... contents
... both
Summary
109. Link Analysis for
Results
Web Information
Retrieval
C. Castillo
Labels
Hypothesis
Label Frequency Percentage
Levels of link
analysis
Normal 4,046 61.75%
Ranking
Borderline 709 10.82%
Web spam
Spam 1,447 22.08%
... detection
Can not classify 350 5.34%
... links
... contents
Agreement
... both
Category Kappa Interpretation
Summary
normal 0.62 Substantial agreement
spam 0.63 Substantial agreement
borderline 0.11 Slight agreement
global 0.56 Moderate agreement
Reference collection [Castillo et al., 2006]
110. Link Analysis for
Web Information
Retrieval
C. Castillo
Hypothesis
Levels of link
Hypothesis
1
analysis
Levels of link analysis
2
Ranking
Ranking
3
Web spam
Web spam
4
... detection
... detection
5
... links
... links
6
... contents
7 ... contents
... both
8 ... both
Summary 9 Summary
111. Link Analysis for
Topological spam: link farms
Web Information
Retrieval
C. Castillo
Hypothesis
Levels of link
analysis
Ranking
Web spam
... detection
... links
... contents
... both
Summary
Single-level farms can be detected by searching groups of
nodes sharing their out-links [Gibson et al., 2005]
112. Link Analysis for
Topological spam: link farms
Web Information
Retrieval
C. Castillo
Hypothesis
Levels of link
analysis
Ranking
Web spam
... detection
... links
... contents
... both
Summary
Single-level farms can be detected by searching groups of
nodes sharing their out-links [Gibson et al., 2005]
113. Link Analysis for
Handling large graphs
Web Information
Retrieval
C. Castillo
Hypothesis
Levels of link
analysis
Ranking
Web spam
For large graphs, random access is not possible.
... detection
... links
Large graphs do not fit in main memory
... contents
... both
Streaming model of computation
Summary
114. Link Analysis for
Handling large graphs
Web Information
Retrieval
C. Castillo
Hypothesis
Levels of link
analysis
Ranking
Web spam
For large graphs, random access is not possible.
... detection
... links
Large graphs do not fit in main memory
... contents
... both
Streaming model of computation
Summary
115. Link Analysis for
Handling large graphs
Web Information
Retrieval
C. Castillo
Hypothesis
Levels of link
analysis
Ranking
Web spam
For large graphs, random access is not possible.
... detection
... links
Large graphs do not fit in main memory
... contents
... both
Streaming model of computation
Summary
116. Link Analysis for
Semi-streaming model
Web Information
Retrieval
C. Castillo
Hypothesis
Levels of link
analysis
Ranking
Web spam
... detection
Memory size enough to hold some data per-node
... links
Disk size enough to hold some data per-edge
... contents
A small number of passes over the data
... both
Summary
117. Link Analysis for
Restriction
Web Information
Retrieval
C. Castillo
Semi-streaming model: graph on disk
Hypothesis
Levels of link
1: for node : 1 . . . N do
analysis
INITIALIZE-MEM(node)
2:
Ranking
3: end for
Web spam
4: for distance : 1 . . . d do {Iteration step}
... detection
for src : 1 . . . N do {Follow links in the graph}
... links
5:
... contents
for all links from src to dest do
6:
... both
COMPUTE(src,dest)
7:
Summary
end for
8:
end for
9:
NORMALIZE
10:
11: end for
12: POST-PROCESS
13: return Something
118. Link Analysis for
Restriction
Web Information
Retrieval
C. Castillo
Semi-streaming model: graph on disk
Hypothesis
Levels of link
1: for node : 1 . . . N do
analysis
INITIALIZE-MEM(node)
2:
Ranking
3: end for
Web spam
4: for distance : 1 . . . d do {Iteration step}
... detection
for src : 1 . . . N do {Follow links in the graph}
... links
5:
... contents
for all links from src to dest do
6:
... both
COMPUTE(src,dest)
7:
Summary
end for
8:
end for
9:
NORMALIZE
10:
11: end for
12: POST-PROCESS
13: return Something
119. Link Analysis for
Restriction
Web Information
Retrieval
C. Castillo
Semi-streaming model: graph on disk
Hypothesis
Levels of link
1: for node : 1 . . . N do
analysis
INITIALIZE-MEM(node)
2:
Ranking
3: end for
Web spam
4: for distance : 1 . . . d do {Iteration step}
... detection
for src : 1 . . . N do {Follow links in the graph}
... links
5:
... contents
for all links from src to dest do
6:
... both
COMPUTE(src,dest)
7:
Summary
end for
8:
end for
9:
NORMALIZE
10:
11: end for
12: POST-PROCESS
13: return Something
120. Link Analysis for
Link-Based Features
Web Information
Retrieval
C. Castillo
Hypothesis
Levels of link
analysis
Ranking
Degree-related measures
Web spam
PageRank
... detection
... links
TrustRank [Gy¨ngyi et al., 2004]
o
... contents
Truncated PageRank [Becchetti et al., 2006]
... both
Estimation of supporters [Becchetti et al., 2006]
Summary
140 features per host (2 pages per host)
121. Link Analysis for
Degree-Based
Web Information
Retrieval
C. Castillo
Hypothesis 0.12
Normal
Spam
Levels of link 0.10
analysis
0.08
Ranking
0.06
Web spam
0.04
... detection
0.02
... links
... contents 0.00
4 18 76 323 1380 5899 25212 107764 460609 1968753
0.14
... both Normal
Spam
0.12
Summary
0.10
0.08
0.06
0.04
0.02
0.00
0.0 0.0 0.0 0.1 0.6 4.9 40.0 327.9 2686.5 22009.9
122. Link Analysis for
TrustRank
Web Information
Retrieval
C. Castillo
Hypothesis
Levels of link
analysis
Ranking
TrustRank [Gy¨ngyi et al., 2004]
o
Web spam
A node with high PageRank, but far away from a core set of
... detection
“trusted nodes” is suspicious
... links
... contents
Start from a set of trusted nodes, then do a random walk,
... both
returning to the set of trusted nodes with probability 1 − α at
Summary
each step
i Trusted nodes: data from http://www.dmoz.org/
123. Link Analysis for
TrustRank
Web Information
Retrieval
C. Castillo
Hypothesis
Levels of link
analysis
Ranking
TrustRank [Gy¨ngyi et al., 2004]
o
Web spam
A node with high PageRank, but far away from a core set of
... detection
“trusted nodes” is suspicious
... links
... contents
Start from a set of trusted nodes, then do a random walk,
... both
returning to the set of trusted nodes with probability 1 − α at
Summary
each step
i Trusted nodes: data from http://www.dmoz.org/
124. Link Analysis for
TrustRank Idea
Web Information
Retrieval
C. Castillo
Hypothesis
Levels of link
analysis
Ranking
Web spam
... detection
... links
... contents
... both
Summary
125. Link Analysis for
TrustRank / PageRank
Web Information
Retrieval
C. Castillo
Hypothesis
Levels of link
analysis
Ranking 1.00
Normal
Spam
Web spam 0.90
0.80
... detection 0.70
... links 0.60
0.50
... contents 0.40
0.30
... both
0.20
Summary 0.10
0.00
0.4 1 4 1e+01 4e+01 1e+02 3e+02 1e+03 3e+03 9e+03
126. Link Analysis for
High and low-ranked pages are different
Web Information
Retrieval
C. Castillo
4
x 10
Hypothesis
Top 0%−10%
12
Levels of link Top 40%−50%
analysis
Top 60%−70%
10
Ranking
Number of Nodes
Web spam
8
... detection
... links
6
... contents
... both
4
Summary
2
0
1 5 10 15 20
Distance
Areas below the curves are equal if we are in the same
strongly-connected component
127. Link Analysis for
High and low-ranked pages are different
Web Information
Retrieval
C. Castillo
4
x 10
Hypothesis
Top 0%−10%
12
Levels of link Top 40%−50%
analysis
Top 60%−70%
10
Ranking
Number of Nodes
Web spam
8
... detection
... links
6
... contents
... both
4
Summary
2
0
1 5 10 15 20
Distance
Areas below the curves are equal if we are in the same
strongly-connected component
128. Link Analysis for
Probabilistic counting
Web Information
Retrieval
C. Castillo
Hypothesis
1
1
0
0
Levels of link
0
0
analysis
0
0
0 1
1 1
1
1
Ranking
0 0
1 1
0
0
0
0 0 0
Web spam Propagation of 0
0 1
1
bits using the 1
... detection 0 1
1
“OR” operation 1
0 1
0
... links
1
Target
0 Count bits set
... contents
0
page
0 to estimate
... both 0
0 supporters
0
0
1
1
Summary 1
1
0
0 1
1
0
0
0
0
1
1
0
0
[Becchetti et al., 2006] shows an improvement of ANF
algorithm [Palmer et al., 2002] based on probabilistic
counting [Flajolet and Martin, 1985]
129. Link Analysis for
Probabilistic counting
Web Information
Retrieval
C. Castillo
Hypothesis
1
1
0
0
Levels of link
0
0
analysis
0
0
0 1
1 1
1
1
Ranking
0 0
1 1
0
0
0
0 0 0
Web spam Propagation of 0
0 1
1
bits using the 1
... detection 0 1
1
“OR” operation 1
0 1
0
... links
1
Target
0 Count bits set
... contents
0
page
0 to estimate
... both 0
0 supporters
0
0
1
1
Summary 1
1
0
0 1
1
0
0
0
0
1
1
0
0
[Becchetti et al., 2006] shows an improvement of ANF
algorithm [Palmer et al., 2002] based on probabilistic
counting [Flajolet and Martin, 1985]
130. Link Analysis for
Bottleneck number
Web Information
Retrieval
C. Castillo
Hypothesis
bd (x) = minj≤d {|Nj (x)|/|Nj−1 (x)|}. Minimum rate of growth
Levels of link
analysis
of the neighbors of x up to a certain distance. We expect that
Ranking
spam pages form clusters that are somehow isolated from the
Web spam
rest of the Web graph and they have smaller bottleneck
... detection
numbers than non-spam pages.
... links
0.40
Normal
... contents Spam
0.35
... both 0.30
Summary 0.25
0.20
0.15
0.10
0.05
0.00
1.11 1.30 1.52 1.78 2.07 2.42 2.83 3.31 3.87 4.52
131. Link Analysis for
Web Information
Retrieval
C. Castillo
Hypothesis
Levels of link
Hypothesis
1
analysis
Levels of link analysis
2
Ranking
Ranking
3
Web spam
Web spam
4
... detection
... detection
5
... links
... links
6
... contents
7 ... contents
... both
8 ... both
Summary 9 Summary
132. Link Analysis for
Content-Based Features
Web Information
Retrieval
C. Castillo
Hypothesis
Most of these reported in [Ntoulas et al., 2006]:
Levels of link
Number of word in the page and title
analysis
Ranking
Average word length
Web spam
Fraction of anchor text
... detection
Fraction of visible text
... links
... contents
Compression rate
... both
From [Castillo et al., 2007]:
Summary
Corpus precision and corpus recall
Query precision and query recall
Independent trigram likelihood
Entropy of trigrams
133. Link Analysis for
Average word length
Web Information
Retrieval
C. Castillo
Hypothesis
0.12
Normal
Levels of link
Spam
analysis
0.10
Ranking
0.08
Web spam
... detection
0.06
... links
... contents 0.04
... both
0.02
Summary
0.00
3.0 3.5 4.0 4.5 5.0 5.5 6.0 6.5 7.0 7.5
Figure: Histogram of the average word length in non-spam vs.
spam pages for k = 500.
134. Link Analysis for
Corpus precision
Web Information
Retrieval
C. Castillo
Hypothesis
0.10
Normal
Levels of link
0.09 Spam
analysis
0.08
Ranking
0.07
Web spam
0.06
... detection
0.05
... links
0.04
... contents
0.03
... both
0.02
Summary
0.01
0.00
0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7
Figure: Histogram of the corpus precision in non-spam vs. spam
pages.
135. Link Analysis for
Query precision
Web Information
Retrieval
C. Castillo
Hypothesis
0.12
Normal
Levels of link
Spam
analysis
0.10
Ranking
0.08
Web spam
... detection
0.06
... links
... contents 0.04
... both
0.02
Summary
0.00
0.0 0.1 0.2 0.3 0.4 0.5 0.6
Figure: Histogram of the query precision in non-spam vs. spam
pages for k = 500.
136. Link Analysis for
Web Information
Retrieval
C. Castillo
Hypothesis
Levels of link
Hypothesis
1
analysis
Levels of link analysis
2
Ranking
Ranking
3
Web spam
Web spam
4
... detection
... detection
5
... links
... links
6
... contents
7 ... contents
... both
8 ... both
Summary 9 Summary
137. Link Analysis for
General hypothesis
Web Information
Retrieval
C. Castillo
Hypothesis
Levels of link
analysis
Ranking
Web spam
Pages topologically close to each other are more likely to have
... detection
the same label (spam/nonspam) than random pairs of pages
... links
... contents
Ideas for exploiting this: clustering, propagation, stacked
... both
learning
Summary
138. Link Analysis for
Web Information
Retrieval
C. Castillo
Hypothesis
Levels of link
analysis
Ranking
Web spam
... detection
... links
... contents
... both
Summary
[Castillo et al., 2007]
139. Link Analysis for
Topological dependencies: in-links
Web Information
Retrieval
C. Castillo
Hypothesis
Histogram of fraction of spam hosts in the in-links
Levels of link
analysis
0 = no in-link comes from spam hosts
Ranking
1 = all of the in-links come from spam hosts
Web spam
... detection
0.4
... links In-links of non spam
In-links of spam
0.35
... contents
0.3
... both
0.25
Summary
0.2
0.15
0.1
0.05
0
0.0 0.2 0.4 0.6 0.8 1.0
140. Link Analysis for
Topological dependencies: out-links
Web Information
Retrieval
C. Castillo
Hypothesis
Histogram of fraction of spam hosts in the out-links
Levels of link
analysis
0 = none of the out-links points to spam hosts
Ranking
1 = all of the out-links point to spam hosts
Web spam
... detection
1
... links Out-links of non spam
0.9 Outlinks of spam
... contents
0.8
... both 0.7
Summary 0.6
0.5
0.4
0.3
0.2
0.1
0
0.0 0.2 0.4 0.6 0.8 1.0
141. Link Analysis for
Idea 1: Clustering
Web Information
Retrieval
C. Castillo
Hypothesis
Levels of link
analysis
Ranking
Web spam
... detection
Classify, then cluster hosts, then assign the same label to all
... links
hosts in the same cluster by majority voting
... contents
... both
Summary
142. Link Analysis for
Idea 1: Clustering (cont.)
Web Information
Retrieval
C. Castillo
Hypothesis
Initial prediction:
Levels of link
analysis
Ranking
Web spam
... detection
... links
... contents
... both
Summary
143. Link Analysis for
Idea 1: Clustering (cont.)
Web Information
Retrieval
C. Castillo
Hypothesis
Clustering:
Levels of link
analysis
Ranking
Web spam
... detection
... links
... contents
... both
Summary
144. Link Analysis for
Idea 1: Clustering (cont.)
Web Information
Retrieval
C. Castillo
Hypothesis
Final prediction:
Levels of link
analysis
Ranking
Web spam
... detection
... links
... contents
... both
Summary
145. Link Analysis for
Idea 1: Clustering – Results
Web Information
Retrieval
C. Castillo
Hypothesis
Levels of link
Baseline Clustering
analysis
Without bagging
Ranking
Web spam
True positive rate 75.6% 74.5%
... detection
False positive rate 8.5% 6.8%
... links
F-Measure 0.646 0.673
... contents
With bagging
... both
True positive rate 78.7% 76.9%
Summary
False positive rate 5.7% 5.0%
F-Measure 0.723 0.728
V Reduces error rate
146. Link Analysis for
Idea 2: Propagate the label
Web Information
Retrieval
C. Castillo
Hypothesis
Levels of link
analysis
Ranking
Web spam
... detection
Classify, then interpret “spamicity” as a probability, then do a
... links
random walk with restart from those nodes
... contents
... both
Summary
147. Link Analysis for
Idea 2: Propagate the label (cont.)
Web Information
Retrieval
C. Castillo
Hypothesis
Levels of link
Initial prediction:
analysis
Ranking
Web spam
... detection
... links
... contents
... both
Summary
148. Link Analysis for
Idea 2: Propagate the label (cont.)
Web Information
Retrieval
C. Castillo
Hypothesis
Levels of link
Propagation:
analysis
Ranking
Web spam
... detection
... links
... contents
... both
Summary
149. Link Analysis for
Idea 2: Propagate the label (cont.)
Web Information
Retrieval
C. Castillo
Hypothesis
Levels of link
Final prediction, applying a threshold:
analysis
Ranking
Web spam
... detection
... links
... contents
... both
Summary
150. Link Analysis for
Idea 2: Propagate the label – Results
Web Information
Retrieval
C. Castillo
Hypothesis
Levels of link
analysis
Baseline Fwds. Backwds. Both
Ranking
Classifier without bagging
Web spam
True positive rate 75.6% 70.9% 69.4% 71.4%
... detection
False positive rate 8.5% 6.1% 5.8% 5.8%
... links
F-Measure 0.646 0.665 0.664 0.676
... contents
... both
Classifier with bagging
Summary
True positive rate 78.7% 76.5% 75.0% 75.2%
False positive rate 5.7% 5.4% 4.3% 4.7%
F-Measure 0.723 0.716 0.733 0.724
151. Link Analysis for
Idea 3: Stacked graphical learning
Web Information
Retrieval
C. Castillo
Hypothesis
Levels of link
analysis
Ranking
Web spam
Meta-learning scheme [Cohen and Kou, 2006]
... detection
Derive initial predictions
... links
Generate an additional attribute for each object by
... contents
combining predictions on neighbors in the graph
... both
Summary
Append additional attribute in the data and retrain
152. Link Analysis for
Idea 3: Stacked graphical learning (cont.)
Web Information
Retrieval
C. Castillo
Hypothesis
Levels of link
analysis
Let p(x) ∈ [0..1] be the prediction of a classification
Ranking
algorithm for a host x using k features
Web spam
... detection
Let N(x) be the set of pages related to x (in some way)
... links
Compute
... contents
g ∈N(x) p(g )
... both
f (x) =
|N(x)|
Summary
Add f (x) as an extra feature for instance x and learn a
new model with k + 1 features
153. Link Analysis for
Idea 3: Stacked graphical learning (cont.)
Web Information
Retrieval
C. Castillo
Hypothesis
Levels of link
analysis
Ranking
Initial prediction:
Web spam
... detection
... links
... contents
... both
Summary
154. Link Analysis for
Idea 3: Stacked graphical learning (cont.)
Web Information
Retrieval
C. Castillo
Hypothesis
Levels of link
analysis
Computation of new feature:
Ranking
Web spam
... detection
... links
... contents
... both
Summary
155. Link Analysis for
Idea 3: Stacked graphical learning (cont.)
Web Information
Retrieval
C. Castillo
Hypothesis
Levels of link
New prediction with k + 1 features:
analysis
Ranking
Web spam
... detection
... links
... contents
... both
Summary
156. Link Analysis for
Idea 3: Stacked graphical learning - Results
Web Information
Retrieval
C. Castillo
Hypothesis
Levels of link
analysis
Ranking
Avg. Avg. Avg.
Web spam
Baseline of in of out of both
... detection
True positive rate 78.7% 84.4% 78.3% 85.2%
... links
False positive rate 5.7% 6.7% 4.8% 6.1%
... contents
F-Measure 0.723 0.733 0.742
... both
0.750
Summary
V Increases detection rate
157. Link Analysis for
Idea 3: Stacked graphical learning x2
Web Information
Retrieval
C. Castillo
Hypothesis
Levels of link
analysis
Ranking
And repeat ...
Web spam
... detection
Baseline First pass Second pass
... links
True positive rate 78.7% 85.2% 88.4%
... contents
False positive rate 5.7% 6.1% 6.3%
... both
F-Measure 0.723 0.750 0.763
Summary
V Significant improvement over the baseline
158. Link Analysis for
Web Information
Retrieval
C. Castillo
Hypothesis
Levels of link
Hypothesis
1
analysis
Levels of link analysis
2
Ranking
Ranking
3
Web spam
Web spam
4
... detection
... detection
5
... links
... links
6
... contents
7 ... contents
... both
8 ... both
Summary 9 Summary
159. Link Analysis for
Concluding remarks
Web Information
Retrieval
C. Castillo
Hypothesis
Levels of link
analysis
Ranking
Web spam
... detection
Hypothesis: topical locality + link endorsement
... links
Primitives: similarity, ranking, propagation, etc.
... contents
Application to Web spam
... both
Summary
160. Link Analysis for
Concluding remarks
Web Information
Retrieval
C. Castillo
Hypothesis
Levels of link
analysis
Ranking
Web spam
... detection
Hypothesis: topical locality + link endorsement
... links
Primitives: similarity, ranking, propagation, etc.
... contents
Application to Web spam
... both
Summary
161. Link Analysis for
Concluding remarks
Web Information
Retrieval
C. Castillo
Hypothesis
Levels of link
analysis
Ranking
Web spam
... detection
Hypothesis: topical locality + link endorsement
... links
Primitives: similarity, ranking, propagation, etc.
... contents
Application to Web spam
... both
Summary
162. Link Analysis for
Web Information
Retrieval
C. Castillo
Hypothesis
Levels of link
analysis
Ranking
Web spam
Thank you!
... detection
... links
... contents
... both
Summary
163. Link Analysis for
Web Information
Baeza-Yates, R., Boldi, P., and Castillo, C. (2006).
Retrieval
Generalizing pagerank: Damping functions for link-based ranking
C. Castillo
algorithms.
In Proceedings of ACM SIGIR, pages 308–315, Seattle, Washington, USA.
Hypothesis
ACM Press.
Levels of link
Baeza-Yates, R., Castillo, C., and Efthimiadis, E. (2007).
analysis
Characterization of national web domains.
Ranking
ACM Transactions on Internet Technology, 7(2).
Web spam
Baeza-Yates, R. and Poblete, B. (2006).
... detection
Dynamics of the chilean web structure.
... links Comput. Networks, 50(10):1464–1473.
... contents
Baeza-Yates, R., Saint-Jean, F., and Castillo, C. (2002).
... both Web structure, dynamics and page quality.
In Proceedings of String Processing and Information Retrieval (SPIRE),
Summary
volume 2476 of Lecture Notes in Computer Science, Lisbon, Portugal.
Springer.
Barab´si, A.-L. (2002).
a
Linked: The New Science of Networks.
Perseus Books Group.
Barab´si, A. L. and Albert, R. (1999).
a
Emergence of scaling in random networks.
Science, 286(5439):509–512.
164. Link Analysis for
Web Information Becchetti, L., Castillo, C., Donato, D., Leonardi, S., and Baeza-Yates, R.
Retrieval
(2006).
Using rank propagation and probabilistic counting for link-based spam
C. Castillo
detection.
Hypothesis In Proceedings of the Workshop on Web Mining and Web Usage Analysis
(WebKDD), Pennsylvania, USA. ACM Press.
Levels of link
analysis
Broder, A., Kumar, R., Maghoul, F., Raghavan, P., Rajagopalan, S.,
Ranking
Stata, R., Tomkins, A., and Wiener, J. (2000).
Web spam Graph structure in the web: Experiments and models.
In Proceedings of the Ninth Conference on World Wide Web, pages
... detection
309–320, Amsterdam, Netherlands. ACM Press.
... links
Castillo, C., Donato, D., Becchetti, L., Boldi, P., Leonardi, S., Santini, M.,
... contents
and Vigna, S. (2006).
... both
A reference collection for web spam.
SIGIR Forum, 40(2):11–24.
Summary
Castillo, C., Donato, D., Gionis, A., Murdock, V., and Silvestri, F. (2007).
Know your neighbors: Web spam detection using the web topology.
In Proceedings of SIGIR, Amsterdam, Netherlands. ACM.
Chellapilla, K. and Maykov, A. (2007).
A taxonomy of javascript redirection spam.
In AIRWeb ’07: Proceedings of the 3rd international workshop on
Adversarial information retrieval on the web, pages 81–88, New York, NY,
USA. ACM Press.