1. Web Spam
Detection
C. Castillo,
D. Donato,
A. Gionis,
V. Murdock,
F. Silvestri
Spam on the
Web
Detecting Web
Spam
Know your Neighbors
Link-Based
Detection Web Spam Detection Using the Web Topology
Content-Based
Detection
Using Links and
Contents Carlos Castillo1 , Debora Donato1 , Aristides Gionis1 ,
Using the Web
Topology
Vanessa Murdock1 , Fabrizio Silvestri2
Conclusions
1. Yahoo! Research Barcelona – Catalunya, Spain
2. ISTI-CNR –Pisa,Italy
ACM SIGIR, 25 July 2007, Amsterdam
2. Web Spam
Detection
C. Castillo,
D. Donato,
A. Gionis,
1 Spam on the Web
V. Murdock,
F. Silvestri
2 Detecting Web Spam
Spam on the
Web
Detecting Web
Spam 3 Link-Based Detection
Link-Based
Detection
Content-Based
4 Content-Based Detection
Detection
Using Links and
Contents 5 Using Links and Contents
Using the Web
Topology
Conclusions 6 Using the Web Topology
7 Conclusions
3. Web Spam
Detection
C. Castillo,
D. Donato,
A. Gionis,
V. Murdock,
F. Silvestri
Spam on the
Web 1 Spam on the Web
Detecting Web
2 Detecting Web Spam
Spam 3 Link-Based Detection
Link-Based 4 Content-Based Detection
Detection
5 Using Links and Contents
Content-Based
Detection 6 Using the Web Topology
Using Links and 7 Conclusions
Contents
Using the Web
Topology
Conclusions
4. Web Spam
Detection What is on the Web?
C. Castillo,
D. Donato,
A. Gionis,
V. Murdock,
F. Silvestri
Spam on the
Web
Detecting Web
Spam
Link-Based
Detection
Content-Based
Detection
Using Links and
Contents
Using the Web
Topology
Conclusions
5. Web Spam
Detection What is on the Web [2.0]?
C. Castillo,
D. Donato,
A. Gionis,
V. Murdock,
F. Silvestri
Spam on the
Web
Detecting Web
Spam
Link-Based
Detection
Content-Based
Detection
Using Links and
Contents
Using the Web
Topology
Conclusions
6. Web Spam
Detection What else is on the Web?
C. Castillo,
D. Donato,
A. Gionis,
V. Murdock,
F. Silvestri
Spam on the
Web
Detecting Web
Spam
Link-Based
Detection
Content-Based
Detection
Using Links and
Contents
Using the Web
Topology
Conclusions
Source: www.milliondollarhomepage.com
7. Web Spam
Detection What’s happening on the Web?
C. Castillo,
D. Donato,
A. Gionis,
V. Murdock,
F. Silvestri
Spam on the
Web
Detecting Web
Spam
Link-Based
Detection
There is a fierce competition
Content-Based
Detection
Using Links and
Contents
for your attention
Using the Web
Topology
Conclusions
8. Web Spam
Detection What’s happening on the Web?
C. Castillo,
D. Donato,
A. Gionis,
V. Murdock,
F. Silvestri
Spam on the
Web
Detecting Web
Spam
Link-Based
Search engines are to some extent
Detection
Content-Based
Detection
arbiters of this competition
Using Links and
Contents and they must watch it closely, otherwise ...
Using the Web
Topology
Conclusions
9. Web Spam
Detection Some cheating occurs
C. Castillo,
D. Donato,
A. Gionis,
V. Murdock,
F. Silvestri
Spam on the
Web
Detecting Web
Spam
Link-Based
Detection
Content-Based
Detection
Using Links and
Contents
Using the Web
Topology
Conclusions
1986 FIFA World Cup, Argentina vs England
10. Web Spam
Detection Simple web spam
C. Castillo,
D. Donato,
A. Gionis,
V. Murdock,
F. Silvestri
Spam on the
Web
Detecting Web
Spam
Link-Based
Detection
Content-Based
Detection
Using Links and
Contents
Using the Web
Topology
Conclusions
11. Web Spam
Detection Hidden text
C. Castillo,
D. Donato,
A. Gionis,
V. Murdock,
F. Silvestri
Spam on the
Web
Detecting Web
Spam
Link-Based
Detection
Content-Based
Detection
Using Links and
Contents
Using the Web
Topology
Conclusions
12. Web Spam
Detection Made for advertising
C. Castillo,
D. Donato,
A. Gionis,
V. Murdock,
F. Silvestri
Spam on the
Web
Detecting Web
Spam
Link-Based
Detection
Content-Based
Detection
Using Links and
Contents
Using the Web
Topology
Conclusions
13. Web Spam
Detection Search engine?
C. Castillo,
D. Donato,
A. Gionis,
V. Murdock,
F. Silvestri
Spam on the
Web
Detecting Web
Spam
Link-Based
Detection
Content-Based
Detection
Using Links and
Contents
Using the Web
Topology
Conclusions
14. Web Spam
Detection Fake search engine
C. Castillo,
D. Donato,
A. Gionis,
V. Murdock,
F. Silvestri
Spam on the
Web
Detecting Web
Spam
Link-Based
Detection
Content-Based
Detection
Using Links and
Contents
Using the Web
Topology
Conclusions
15. Web Spam
Detection “Normal” content in link farms
C. Castillo,
D. Donato,
A. Gionis,
V. Murdock,
F. Silvestri
Spam on the
Web
Detecting Web
Spam
Link-Based
Detection
Content-Based
Detection
Using Links and
Contents
Using the Web
Topology
Conclusions
16. Web Spam
Detection There are many attempts of cheating on the Web
C. Castillo,
D. Donato,
A. Gionis,
V. Murdock,
F. Silvestri
Spam on the
Web
Detecting Web
Spam
Most of these are spam:
Link-Based
Detection
1,630,000 results for “free mp3 hilton viagra” in SE1
Content-Based
Detection 1,760,000 results for “credit vicodin loan” in SE2
Using Links and
Contents 1,320,000 results for “porn mortgage” in SE3
Using the Web
Topology
Conclusions
17. Web Spam
Detection Costs
C. Castillo,
D. Donato,
A. Gionis,
V. Murdock,
F. Silvestri
Spam on the
Web
Detecting Web Costs:
Spam
Link-Based X Costs for users: lower precision for some queries
Detection
Content-Based X Costs for search engines: wasted storage space,
Detection
network resources, and processing cycles
Using Links and
Contents X Costs for the publishers: resources invested in cheating
Using the Web
Topology
and not in improving their contents
Conclusions
18. Web Spam
Detection Cheating on the Web
C. Castillo,
D. Donato,
A. Gionis,
V. Murdock,
F. Silvestri
Z Link spam
Spam on the
Web Z Content spam
Detecting Web Spam-oriented blogging
Spam
Link-Based Comment/forum/Wiki spam
Detection
Content-Based
Malicious cloaking
Detection
Click fraud ×2
Using Links and
Contents Malicious tagging
Using the Web
Topology . . . more?
Conclusions
Adversarial relationship
Every undeserved gain in ranking for a spammer, is a loss of
precision for the search engine.
19. Web Spam
Detection Cheating on the Web
C. Castillo,
D. Donato,
A. Gionis,
V. Murdock,
F. Silvestri
Z Link spam
Spam on the
Web Z Content spam
Detecting Web Spam-oriented blogging
Spam
Link-Based Comment/forum/Wiki spam
Detection
Content-Based
Malicious cloaking
Detection
Click fraud ×2
Using Links and
Contents Malicious tagging
Using the Web
Topology . . . more?
Conclusions
Adversarial relationship
Every undeserved gain in ranking for a spammer, is a loss of
precision for the search engine.
20. Web Spam
Detection Cheating on the Web
C. Castillo,
D. Donato,
A. Gionis,
V. Murdock,
F. Silvestri
Z Link spam
Spam on the
Web Z Content spam
Detecting Web Spam-oriented blogging
Spam
Link-Based Comment/forum/Wiki spam
Detection
Content-Based
Malicious cloaking
Detection
Click fraud ×2
Using Links and
Contents Malicious tagging
Using the Web
Topology . . . more?
Conclusions
Adversarial relationship
Every undeserved gain in ranking for a spammer, is a loss of
precision for the search engine.
21. Web Spam
Detection Cheating on the Web
C. Castillo,
D. Donato,
A. Gionis,
V. Murdock,
F. Silvestri
Z Link spam
Spam on the
Web Z Content spam
Detecting Web Spam-oriented blogging
Spam
Link-Based Comment/forum/Wiki spam
Detection
Content-Based
Malicious cloaking
Detection
Click fraud ×2
Using Links and
Contents Malicious tagging
Using the Web
Topology . . . more?
Conclusions
Adversarial relationship
Every undeserved gain in ranking for a spammer, is a loss of
precision for the search engine.
22. Web Spam
Detection Cheating on the Web
C. Castillo,
D. Donato,
A. Gionis,
V. Murdock,
F. Silvestri
Z Link spam
Spam on the
Web Z Content spam
Detecting Web Spam-oriented blogging
Spam
Link-Based Comment/forum/Wiki spam
Detection
Content-Based
Malicious cloaking
Detection
Click fraud ×2
Using Links and
Contents Malicious tagging
Using the Web
Topology . . . more?
Conclusions
Adversarial relationship
Every undeserved gain in ranking for a spammer, is a loss of
precision for the search engine.
23. Web Spam
Detection Cheating on the Web
C. Castillo,
D. Donato,
A. Gionis,
V. Murdock,
F. Silvestri
Z Link spam
Spam on the
Web Z Content spam
Detecting Web Spam-oriented blogging
Spam
Link-Based Comment/forum/Wiki spam
Detection
Content-Based
Malicious cloaking
Detection
Click fraud ×2
Using Links and
Contents Malicious tagging
Using the Web
Topology . . . more?
Conclusions
Adversarial relationship
Every undeserved gain in ranking for a spammer, is a loss of
precision for the search engine.
24. Web Spam
Detection Cheating on the Web
C. Castillo,
D. Donato,
A. Gionis,
V. Murdock,
F. Silvestri
Z Link spam
Spam on the
Web Z Content spam
Detecting Web Spam-oriented blogging
Spam
Link-Based Comment/forum/Wiki spam
Detection
Content-Based
Malicious cloaking
Detection
Click fraud ×2
Using Links and
Contents Malicious tagging
Using the Web
Topology . . . more?
Conclusions
Adversarial relationship
Every undeserved gain in ranking for a spammer, is a loss of
precision for the search engine.
25. Web Spam
Detection Cheating on the Web
C. Castillo,
D. Donato,
A. Gionis,
V. Murdock,
F. Silvestri
Z Link spam
Spam on the
Web Z Content spam
Detecting Web Spam-oriented blogging
Spam
Link-Based Comment/forum/Wiki spam
Detection
Content-Based
Malicious cloaking
Detection
Click fraud ×2
Using Links and
Contents Malicious tagging
Using the Web
Topology . . . more?
Conclusions
Adversarial relationship
Every undeserved gain in ranking for a spammer, is a loss of
precision for the search engine.
26. Web Spam
Detection Cheating on the Web
C. Castillo,
D. Donato,
A. Gionis,
V. Murdock,
F. Silvestri
Z Link spam
Spam on the
Web Z Content spam
Detecting Web Spam-oriented blogging
Spam
Link-Based Comment/forum/Wiki spam
Detection
Content-Based
Malicious cloaking
Detection
Click fraud ×2
Using Links and
Contents Malicious tagging
Using the Web
Topology . . . more?
Conclusions
Adversarial relationship
Every undeserved gain in ranking for a spammer, is a loss of
precision for the search engine.
27. Web Spam
Detection Cheating on the Web
C. Castillo,
D. Donato,
A. Gionis,
V. Murdock,
F. Silvestri
Z Link spam
Spam on the
Web Z Content spam
Detecting Web Spam-oriented blogging
Spam
Link-Based Comment/forum/Wiki spam
Detection
Content-Based
Malicious cloaking
Detection
Click fraud ×2
Using Links and
Contents Malicious tagging
Using the Web
Topology . . . more?
Conclusions
Adversarial relationship
Every undeserved gain in ranking for a spammer, is a loss of
precision for the search engine.
28. Web Spam
Detection
C. Castillo,
D. Donato,
A. Gionis,
V. Murdock,
F. Silvestri
Spam on the
Web 1 Spam on the Web
Detecting Web
2 Detecting Web Spam
Spam 3 Link-Based Detection
Link-Based 4 Content-Based Detection
Detection
5 Using Links and Contents
Content-Based
Detection 6 Using the Web Topology
Using Links and 7 Conclusions
Contents
Using the Web
Topology
Conclusions
29. Web Spam
Detection Research on Web spam detection
C. Castillo,
D. Donato,
A. Gionis,
V. Murdock,
F. Silvestri
Spam on the
Web
Web spam detection techniques
Detecting Web
Spam
Link-Based
Detection
Content-Based
Detection
Using Links and
Contents
Using the Web
Topology
Conclusions
30. Web Spam
Detection Spam, damn spam and statistics
C. Castillo,
D. Donato,
A. Gionis,
V. Murdock,
F. Silvestri [Fetterly et al., 2004] propose to study statistical
distributions: “in a number of these distributions, outlier
Spam on the
Web values are associated with web spam”
Detecting Web
Spam
Link-Based
Detection
Content-Based
Detection
Using Links and
Contents
Using the Web
Topology
Conclusions
31. Web Spam
Detection Machine learning training
C. Castillo,
D. Donato,
A. Gionis,
V. Murdock,
F. Silvestri
Spam on the
Web
Detecting Web
Spam
Link-Based
Detection
Content-Based
Detection
Using Links and
Contents
Using the Web
Topology
Conclusions
32. Web Spam
Detection Machine learning
C. Castillo,
D. Donato,
A. Gionis,
V. Murdock,
F. Silvestri
Spam on the
Web
Detecting Web
Spam
Link-Based
Detection
Content-Based
Detection
Using Links and
Contents
Using the Web
Topology
Conclusions
33. Web Spam
Detection Challenges
C. Castillo,
D. Donato,
A. Gionis,
V. Murdock,
F. Silvestri
Spam on the Scalability
Web
+ Machine Learning Challenges:
Detecting Web
Spam
Instances are not really independent (graph)
Link-Based
Detection Training set is relatively small
Content-Based
Detection + Information Retrieval Challenges:
Using Links and
Contents It is hard to find out which features are relevant
Using the Web
Topology Features can be aggregated in content units:
Conclusions page/host/domain
Features can be propagated through the graph
34. Web Spam
Detection Challenges
C. Castillo,
D. Donato,
A. Gionis,
V. Murdock,
F. Silvestri
Spam on the Scalability
Web
+ Machine Learning Challenges:
Detecting Web
Spam
Instances are not really independent (graph)
Link-Based
Detection Training set is relatively small
Content-Based
Detection + Information Retrieval Challenges:
Using Links and
Contents It is hard to find out which features are relevant
Using the Web
Topology Features can be aggregated in content units:
Conclusions page/host/domain
Features can be propagated through the graph
35. Web Spam
Detection Challenges
C. Castillo,
D. Donato,
A. Gionis,
V. Murdock,
F. Silvestri
Spam on the Scalability
Web
+ Machine Learning Challenges:
Detecting Web
Spam
Instances are not really independent (graph)
Link-Based
Detection Training set is relatively small
Content-Based
Detection + Information Retrieval Challenges:
Using Links and
Contents It is hard to find out which features are relevant
Using the Web
Topology Features can be aggregated in content units:
Conclusions page/host/domain
Features can be propagated through the graph
36. Web Spam
Detection Challenges
C. Castillo,
D. Donato,
A. Gionis,
V. Murdock,
F. Silvestri
Spam on the Scalability
Web
+ Machine Learning Challenges:
Detecting Web
Spam
Instances are not really independent (graph)
Link-Based
Detection Training set is relatively small
Content-Based
Detection + Information Retrieval Challenges:
Using Links and
Contents It is hard to find out which features are relevant
Using the Web
Topology Features can be aggregated in content units:
Conclusions page/host/domain
Features can be propagated through the graph
37. Web Spam
Detection Challenges
C. Castillo,
D. Donato,
A. Gionis,
V. Murdock,
F. Silvestri
Spam on the Scalability
Web
+ Machine Learning Challenges:
Detecting Web
Spam
Instances are not really independent (graph)
Link-Based
Detection Training set is relatively small
Content-Based
Detection + Information Retrieval Challenges:
Using Links and
Contents It is hard to find out which features are relevant
Using the Web
Topology Features can be aggregated in content units:
Conclusions page/host/domain
Features can be propagated through the graph
38. Web Spam
Detection Challenges
C. Castillo,
D. Donato,
A. Gionis,
V. Murdock,
F. Silvestri
Spam on the Scalability
Web
+ Machine Learning Challenges:
Detecting Web
Spam
Instances are not really independent (graph)
Link-Based
Detection Training set is relatively small
Content-Based
Detection + Information Retrieval Challenges:
Using Links and
Contents It is hard to find out which features are relevant
Using the Web
Topology Features can be aggregated in content units:
Conclusions page/host/domain
Features can be propagated through the graph
39. Web Spam
Detection Challenges
C. Castillo,
D. Donato,
A. Gionis,
V. Murdock,
F. Silvestri
Spam on the Scalability
Web
+ Machine Learning Challenges:
Detecting Web
Spam
Instances are not really independent (graph)
Link-Based
Detection Training set is relatively small
Content-Based
Detection + Information Retrieval Challenges:
Using Links and
Contents It is hard to find out which features are relevant
Using the Web
Topology Features can be aggregated in content units:
Conclusions page/host/domain
Features can be propagated through the graph
40. Web Spam
Detection Challenges
C. Castillo,
D. Donato,
A. Gionis,
V. Murdock,
F. Silvestri
Spam on the Scalability
Web
+ Machine Learning Challenges:
Detecting Web
Spam
Instances are not really independent (graph)
Link-Based
Detection Training set is relatively small
Content-Based
Detection + Information Retrieval Challenges:
Using Links and
Contents It is hard to find out which features are relevant
Using the Web
Topology Features can be aggregated in content units:
Conclusions page/host/domain
Features can be propagated through the graph
41. Web Spam
Detection Challenges
C. Castillo,
D. Donato,
A. Gionis,
V. Murdock,
F. Silvestri
Spam on the Scalability
Web
+ Machine Learning Challenges:
Detecting Web
Spam
Instances are not really independent (graph)
Link-Based
Detection Training set is relatively small
Content-Based
Detection + Information Retrieval Challenges:
Using Links and
Contents It is hard to find out which features are relevant
Using the Web
Topology Features can be aggregated in content units:
Conclusions page/host/domain
Features can be propagated through the graph
42. Web Spam
Detection Training data
C. Castillo,
D. Donato,
A. Gionis,
V. Murdock,
F. Silvestri
Spam on the
Web
Detecting Web
Spam
X It is hard for search engines to provide labeled data
Link-Based
Detection X Even if they do, it will not reflect a consensus on what is
Content-Based
Detection
Web Spam
Using Links and V Public Web Spam collection built by a group of
Contents
volunteers: http://www.yr-bcn.es/webspam/
Using the Web
Topology
Conclusions
43. Web Spam
Detection Training data
C. Castillo,
D. Donato,
A. Gionis,
V. Murdock,
F. Silvestri
Spam on the
Web
Detecting Web
Spam
X It is hard for search engines to provide labeled data
Link-Based
Detection X Even if they do, it will not reflect a consensus on what is
Content-Based
Detection
Web Spam
Using Links and V Public Web Spam collection built by a group of
Contents
volunteers: http://www.yr-bcn.es/webspam/
Using the Web
Topology
Conclusions
44. Web Spam
Detection Training data
C. Castillo,
D. Donato,
A. Gionis,
V. Murdock,
F. Silvestri
Spam on the
Web
Detecting Web
Spam
X It is hard for search engines to provide labeled data
Link-Based
Detection X Even if they do, it will not reflect a consensus on what is
Content-Based
Detection
Web Spam
Using Links and V Public Web Spam collection built by a group of
Contents
volunteers: http://www.yr-bcn.es/webspam/
Using the Web
Topology
Conclusions
45. Web Spam
Detection
C. Castillo,
D. Donato,
A. Gionis,
V. Murdock,
F. Silvestri
Spam on the
Web 1 Spam on the Web
Detecting Web
2 Detecting Web Spam
Spam 3 Link-Based Detection
Link-Based 4 Content-Based Detection
Detection
5 Using Links and Contents
Content-Based
Detection 6 Using the Web Topology
Using Links and 7 Conclusions
Contents
Using the Web
Topology
Conclusions
46. Web Spam
Detection “Link farms”
C. Castillo,
D. Donato,
A. Gionis,
V. Murdock,
F. Silvestri
Spam on the
Web Web
Detecting Web
Spam
Link-Based
Detection
Content-Based
Detection
Using Links and
Contents
Link farm
Using the Web
Topology
Conclusions
Spam page
Single-level farms can be detected by searching groups of
nodes sharing their out-links [Gibson et al., 2005]
47. Web Spam
Detection Handling large-graphs
C. Castillo,
D. Donato,
A. Gionis,
V. Murdock,
F. Silvestri
Spam on the
Web
Detecting Web
Spam
Link-Based
Detection
Memory size enough to hold some data per-node
Content-Based Disk size enough to hold some data per-edge
Detection
Using Links and A small number of passes over the data
Contents
Using the Web
Topology
Conclusions
48. Web Spam
Detection Semi-streaming model
C. Castillo,
D. Donato,
A. Gionis,
V. Murdock,
F. Silvestri
1: for node : 1 . . . N do
Spam on the
Web
2: INITIALIZE-MEM(node)
Detecting Web 3: end for
Spam
4: for distance : 1 . . . d do {Iteration step}
Link-Based
Detection 5: for src : 1 . . . N do {Follow links in the graph}
Content-Based 6: for all links from src to dest do
Detection
7: COMPUTE(src,dest)
Using Links and
Contents 8: end for
Using the Web 9: end for
Topology
10: NORMALIZE
Conclusions
11: end for
12: POST-PROCESS
13: return Something
49. Web Spam
Detection Semi-streaming model
C. Castillo,
D. Donato,
A. Gionis,
V. Murdock,
F. Silvestri
1: for node : 1 . . . N do
Spam on the
Web
2: INITIALIZE-MEM(node)
Detecting Web 3: end for
Spam
4: for distance : 1 . . . d do {Iteration step}
Link-Based
Detection 5: for src : 1 . . . N do {Follow links in the graph}
Content-Based 6: for all links from src to dest do
Detection
7: COMPUTE(src,dest)
Using Links and
Contents 8: end for
Using the Web 9: end for
Topology
10: NORMALIZE
Conclusions
11: end for
12: POST-PROCESS
13: return Something
50. Web Spam
Detection Semi-streaming model
C. Castillo,
D. Donato,
A. Gionis,
V. Murdock,
F. Silvestri
1: for node : 1 . . . N do
Spam on the
Web
2: INITIALIZE-MEM(node)
Detecting Web 3: end for
Spam
4: for distance : 1 . . . d do {Iteration step}
Link-Based
Detection 5: for src : 1 . . . N do {Follow links in the graph}
Content-Based 6: for all links from src to dest do
Detection
7: COMPUTE(src,dest)
Using Links and
Contents 8: end for
Using the Web 9: end for
Topology
10: NORMALIZE
Conclusions
11: end for
12: POST-PROCESS
13: return Something
51. Web Spam
Detection Link-based features
C. Castillo,
D. Donato,
A. Gionis,
V. Murdock,
F. Silvestri
Spam on the
Web
Detecting Web
Spam Degree-related measures
Link-Based PageRank
Detection
Content-Based TrustRank [Gy¨ngyi et al., 2004]
o
Detection
Using Links and
Truncated PageRank [Becchetti et al., 2006]
Contents
Estimation of supporters [Becchetti et al., 2006]
Using the Web
Topology
140 features per host (2 pages per host)
Conclusions
52. Web Spam
Detection Degree-Based
C. Castillo,
D. Donato,
A. Gionis,
V. Murdock, 0.12
F. Silvestri Normal
Spam
0.10
Spam on the
Web 0.08
Detecting Web 0.06
Spam
0.04
Link-Based
Detection 0.02
Content-Based 0.00
4 18 76 323 1380 5899 25212 107764 460609 1968753
Detection
0.14
Normal
Using Links and Spam
Contents 0.12
0.10
Using the Web
Topology 0.08
Conclusions 0.06
0.04
0.02
0.00
0.0 0.0 0.0 0.1 0.6 4.9 40.0 327.9 2686.5 22009.9
53. Web Spam
Detection TrustRank
C. Castillo,
D. Donato,
A. Gionis,
V. Murdock,
F. Silvestri
Spam on the
Web
Detecting Web
Spam
Link-Based
Detection
Content-Based
Detection
Using Links and
Contents
Using the Web
Topology
Conclusions
[Gy¨ngyi et al., 2004]
o
54. Web Spam
Detection TrustRank / PageRank
C. Castillo,
D. Donato,
A. Gionis,
V. Murdock,
F. Silvestri
Spam on the
Web
1.00
Normal
Detecting Web 0.90
Spam
Spam
0.80
Link-Based 0.70
Detection 0.60
0.50
Content-Based
0.40
Detection
0.30
Using Links and 0.20
Contents 0.10
Using the Web 0.00
0.4 1 4 1e+01 4e+01 1e+02 3e+02 1e+03 3e+03 9e+03
Topology
Conclusions
55. Web Spam
Detection Truncated PageRank
C. Castillo,
D. Donato,
A. Gionis,
V. Murdock,
F. Silvestri
Proposed in [Becchetti et al., 2006]. Idea: reduce the direct
Spam on the
Web contribution of the first levels of links:
Detecting Web
Spam
Link-Based
Detection
Content-Based
Detection
Using Links and
Contents
Using the Web
Topology
0 t≤T
Conclusions damping(t) =
C αt t>T
V No extra reading of the graph after PageRank
56. Web Spam
Detection Truncated PageRank
C. Castillo,
D. Donato,
A. Gionis,
V. Murdock,
F. Silvestri
Proposed in [Becchetti et al., 2006]. Idea: reduce the direct
Spam on the
Web contribution of the first levels of links:
Detecting Web
Spam
Link-Based
Detection
Content-Based
Detection
Using Links and
Contents
Using the Web
Topology
0 t≤T
Conclusions damping(t) =
C αt t>T
V No extra reading of the graph after PageRank
57. Web Spam
Detection Hop-plot
C. Castillo,
D. Donato,
A. Gionis,
V. Murdock,
F. Silvestri
Spam on the
Web
Detecting Web
Spam
Link-Based
Detection
Content-Based
Detection
Using Links and
Contents
Using the Web
Topology
Conclusions
58. Web Spam
Detection High and low-ranked pages are different
C. Castillo,
D. Donato,
A. Gionis, 4
V. Murdock,
x 10
F. Silvestri Top 0%−10%
12
Top 40%−50%
Spam on the
Web
Top 60%−70%
10
Number of Nodes
Detecting Web
Spam
Link-Based
8
Detection
Content-Based 6
Detection
Using Links and
Contents 4
Using the Web
Topology 2
Conclusions
0
1 5 10 15 20
Distance
Areas below the curves are equal if we are in the same
strongly-connected component
59. Web Spam
Detection High and low-ranked pages are different
C. Castillo,
D. Donato,
A. Gionis, 4
V. Murdock,
x 10
F. Silvestri Top 0%−10%
12
Top 40%−50%
Spam on the
Web
Top 60%−70%
10
Number of Nodes
Detecting Web
Spam
Link-Based
8
Detection
Content-Based 6
Detection
Using Links and
Contents 4
Using the Web
Topology 2
Conclusions
0
1 5 10 15 20
Distance
Areas below the curves are equal if we are in the same
strongly-connected component
60. Web Spam
Detection Probabilistic counting
C. Castillo,
D. Donato,
A. Gionis,
V. Murdock,
F. Silvestri 1 1
0 0
0 0
Spam on the
0 0
Web 1 0 1 1
1 1
1 0 0 0 1
Detecting Web 0
0 0 0 0
Spam 0 Propagation of 0 1
1
0 bits using the 1 1
Link-Based 1
0 “OR” operation 1 1
Detection 0
Content-Based 0 Target 1 Count bits set
Detection 0 page 0 to estimate
0 0 supporters
Using Links and
0 0
Contents 1 1
1 1
0 0 1
Using the Web 1
0 0
Topology 0 0
1 1
Conclusions
0 0
[Becchetti et al., 2006] shows an improvement of ANF
algorithm [Palmer et al., 2002] based on probabilistic
counting [Flajolet and Martin, 1985]
61. Web Spam
Detection Probabilistic counting
C. Castillo,
D. Donato,
A. Gionis,
V. Murdock,
F. Silvestri 1 1
0 0
0 0
Spam on the
0 0
Web 1 0 1 1
1 1
1 0 0 0 1
Detecting Web 0
0 0 0 0
Spam 0 Propagation of 0 1
1
0 bits using the 1 1
Link-Based 1
0 “OR” operation 1 1
Detection 0
Content-Based 0 Target 1 Count bits set
Detection 0 page 0 to estimate
0 0 supporters
Using Links and
0 0
Contents 1 1
1 1
0 0 1
Using the Web 1
0 0
Topology 0 0
1 1
Conclusions
0 0
[Becchetti et al., 2006] shows an improvement of ANF
algorithm [Palmer et al., 2002] based on probabilistic
counting [Flajolet and Martin, 1985]
62. Web Spam
Detection Bottleneck number
C. Castillo,
D. Donato,
A. Gionis,
V. Murdock,
F. Silvestri
Spam on the
Web
Detecting Web
Spam
|Nj (x)|
Link-Based bd (x) = minj≤d |Nj−1 (x)| .
Detection
Content-Based
Detection Minimum rate of growth of the neighbors of x up to a certain
Using Links and
Contents
distance.
Using the Web
Topology
Conclusions
63. Web Spam
Detection Bottleneck number: spam
C. Castillo,
D. Donato,
A. Gionis,
V. Murdock,
F. Silvestri
Spam on the
Web
Detecting Web
Spam
Link-Based
Detection
Content-Based
Detection
Using Links and
Contents
Using the Web
Topology
Conclusions
64. Web Spam
Detection Bottleneck number: normal
C. Castillo,
D. Donato,
A. Gionis,
V. Murdock,
F. Silvestri
Spam on the
Web
Detecting Web
Spam
Link-Based
Detection
Content-Based
Detection
Using Links and
Contents
Using the Web
Topology
Conclusions
65. Web Spam
Detection Bottleneck number
C. Castillo,
D. Donato,
A. Gionis,
V. Murdock,
F. Silvestri
Spam on the
Web
Detecting Web
bd (x) = minj≤d {|Nj (x)|/|Nj−1 (x)|}.
Spam 0.40
Normal
Spam
0.35
Link-Based
Detection 0.30
0.25
Content-Based
Detection 0.20
0.15
Using Links and
Contents 0.10
Using the Web 0.05
Topology 0.00
1.11 1.30 1.52 1.78 2.07 2.42 2.83 3.31 3.87 4.52
Conclusions
66. Web Spam
Detection
C. Castillo,
D. Donato,
A. Gionis,
V. Murdock,
F. Silvestri
Spam on the
Web 1 Spam on the Web
Detecting Web
2 Detecting Web Spam
Spam 3 Link-Based Detection
Link-Based 4 Content-Based Detection
Detection
5 Using Links and Contents
Content-Based
Detection 6 Using the Web Topology
Using Links and 7 Conclusions
Contents
Using the Web
Topology
Conclusions
67. Web Spam
Detection Content-Based Features
C. Castillo,
D. Donato,
A. Gionis,
V. Murdock,
F. Silvestri
Most of the features reported in [Ntoulas et al., 2006]
Spam on the
Web Number of word in the page and title
Detecting Web
Spam
Average word length
Link-Based Fraction of anchor text
Detection
Content-Based
Fraction of visible text
Detection
Compression rate
Using Links and
Contents Corpus precision and corpus recall
Using the Web
Topology Query precision and query recall
Conclusions
Independent trigram likelihood
Entropy of trigrams
96 features per host
68. Web Spam
Detection Content-based features (entropy related)
C. Castillo,
D. Donato,
A. Gionis,
V. Murdock,
F. Silvestri
Spam on the
Web
Detecting Web
Spam T = {(w1 , p1 ), . . . , (wk , pk )} the set of trigrams in a page,
Link-Based
Detection
where trigram wi has frequency pi
Content-Based
Detection Features:
Using Links and
Contents Entropy of trigrams H = − wi ∈T pi log pi
Using the Web
Topology
Also, compression rate, as measured by bzip
Conclusions
69. Web Spam
Detection Content-based features (related to popular
C. Castillo,
D. Donato, keywords)
A. Gionis,
V. Murdock,
F. Silvestri
Spam on the
Web
F set of most frequent terms in the collection
Detecting Web
Spam
Q set of most frequent terms in a query log
Link-Based
Detection P set of terms in a page
Content-Based
Detection
Features:
Using Links and
Contents
Corpus “precision” |P ∩ F |/|P|
Using the Web
Topology Corpus “recall” |P ∩ F |/|F |
Conclusions
Query “precision” |P ∩ Q|/|P|
Query “recall” |P ∩ Q|/|Q|
70. Web Spam
Detection Average word length
C. Castillo,
D. Donato,
A. Gionis,
V. Murdock,
F. Silvestri 0.12
Normal
Spam
Spam on the
0.10
Web
Detecting Web
0.08
Spam
Link-Based
Detection
0.06
Content-Based
Detection 0.04
Using Links and
Contents 0.02
Using the Web
Topology 0.00
3.0 3.5 4.0 4.5 5.0 5.5 6.0 6.5 7.0 7.5
Conclusions
Figure: Histogram of the average word length in non-spam vs.
spam pages for k = 500.
71. Web Spam
Detection Corpus precision
C. Castillo,
D. Donato,
A. Gionis,
V. Murdock,
F. Silvestri 0.10
Normal
0.09 Spam
Spam on the
Web 0.08
Detecting Web 0.07
Spam
0.06
Link-Based
Detection
0.05
0.04
Content-Based
Detection 0.03
Using Links and 0.02
Contents
0.01
Using the Web
Topology 0.00
0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7
Conclusions
Figure: Histogram of the corpus precision in non-spam vs. spam
pages.
72. Web Spam
Detection Query precision
C. Castillo,
D. Donato,
A. Gionis,
V. Murdock,
F. Silvestri 0.12
Normal
Spam
Spam on the
0.10
Web
Detecting Web
0.08
Spam
Link-Based
Detection
0.06
Content-Based
Detection 0.04
Using Links and
Contents 0.02
Using the Web
Topology 0.00
0.0 0.1 0.2 0.3 0.4 0.5 0.6
Conclusions
Figure: Histogram of the query precision in non-spam vs. spam
pages for k = 500.
73. Web Spam
Detection
C. Castillo,
D. Donato,
A. Gionis,
V. Murdock,
F. Silvestri
Spam on the
Web 1 Spam on the Web
Detecting Web
2 Detecting Web Spam
Spam 3 Link-Based Detection
Link-Based 4 Content-Based Detection
Detection
5 Using Links and Contents
Content-Based
Detection 6 Using the Web Topology
Using Links and 7 Conclusions
Contents
Using the Web
Topology
Conclusions
74. Web Spam
Detection Cost-sensitive decision tree with bagging
C. Castillo,
D. Donato,
A. Gionis,
V. Murdock,
F. Silvestri
Spam on the
Web
Detecting Web
Spam Bagging of 10 decision trees, asymmetrical costs.
Link-Based
Detection
Cost ratio 1 10 20 30 50
Content-Based
Detection True positive rate 65.8% 66.7% 71.1% 78.7% 84.1%
Using Links and
Contents
False positive rate 2.8% 3.4% 4.5% 5.7% 8.6%
Using the Web
F-Measure 0.712 0.703 0.704 0.723 0.692
Topology
Conclusions
75. Web Spam
Detection Link- and content-based features
C. Castillo,
D. Donato,
A. Gionis,
V. Murdock,
F. Silvestri
Spam on the
Web
Detecting Web
Spam Link-based and content-based
Link-Based
Detection
Both Link-only Content-only
Content-Based
Detection True positive rate 78.7% 79.4% 64.9%
Using Links and False positive rate 5.7% 9.0% 3.7%
Contents
Using the Web
F-Measure 0.723 0.659 0.683
Topology
Conclusions
76. Web Spam
Detection
C. Castillo,
D. Donato,
A. Gionis,
V. Murdock,
F. Silvestri
Spam on the
Web 1 Spam on the Web
Detecting Web
2 Detecting Web Spam
Spam 3 Link-Based Detection
Link-Based 4 Content-Based Detection
Detection
5 Using Links and Contents
Content-Based
Detection 6 Using the Web Topology
Using Links and 7 Conclusions
Contents
Using the Web
Topology
Conclusions
77. Web Spam
Detection Hypothesis
C. Castillo,
D. Donato,
A. Gionis,
V. Murdock,
F. Silvestri
Spam on the
Web
Detecting Web
Spam Pages topologically close to each other are more likely
Link-Based
Detection
to have the same label (spam/nonspam) than random
Content-Based
pairs of pages.
Detection
Using Links and
Contents
Pages linked together are more likely to be on the same topic
Using the Web
than random pairs of pages [Davison, 2000]
Topology
Conclusions
78. Web Spam
Detection Hypothesis
C. Castillo,
D. Donato,
A. Gionis,
V. Murdock,
F. Silvestri
Spam on the
Web
Detecting Web
Spam Pages topologically close to each other are more likely
Link-Based
Detection
to have the same label (spam/nonspam) than random
Content-Based
pairs of pages.
Detection
Using Links and
Contents
Pages linked together are more likely to be on the same topic
Using the Web
than random pairs of pages [Davison, 2000]
Topology
Conclusions
79. Web Spam
Detection
C. Castillo,
D. Donato,
A. Gionis,
V. Murdock,
F. Silvestri
Spam on the
Web
Detecting Web
Spam
Link-Based
Detection
Content-Based
Detection
Using Links and
Contents
Using the Web
Topology
Conclusions
80. Web Spam
Detection Topological dependencies: in-links
C. Castillo,
D. Donato,
A. Gionis,
V. Murdock,
F. Silvestri
Histogram of fraction of spam hosts in the in-links
Spam on the
Web
0 = no in-link comes from spam hosts
Detecting Web 1 = all of the in-links come from spam hosts
Spam
Link-Based
Detection 0.4
In-links of non spam
Content-Based 0.35 In-links of spam
Detection
0.3
Using Links and
Contents 0.25
Using the Web 0.2
Topology
0.15
Conclusions
0.1
0.05
0
0.0 0.2 0.4 0.6 0.8 1.0
81. Web Spam
Detection Topological dependencies: out-links
C. Castillo,
D. Donato,
A. Gionis,
V. Murdock,
F. Silvestri
Histogram of fraction of spam hosts in the out-links
Spam on the
Web
0 = none of the out-links points to spam hosts
Detecting Web 1 = all of the out-links point to spam hosts
Spam
Link-Based
Detection 1
Out-links of non spam
Content-Based 0.9 Outlinks of spam
Detection 0.8
Using Links and 0.7
Contents 0.6
Using the Web 0.5
Topology 0.4
Conclusions 0.3
0.2
0.1
0
0.0 0.2 0.4 0.6 0.8 1.0
82. Web Spam
Detection Idea 1: Clustering
C. Castillo,
D. Donato,
A. Gionis,
V. Murdock,
F. Silvestri
Spam on the
Web
Detecting Web
Spam
Link-Based
Detection Classify, then cluster hosts, then assign the same label to all
Content-Based
Detection
hosts in the same cluster by majority voting
Using Links and
Contents
Using the Web
Topology
Conclusions
83. Web Spam
Detection Idea 1: Clustering (cont.)
C. Castillo,
D. Donato,
A. Gionis,
V. Murdock,
F. Silvestri Initial prediction:
Spam on the
Web
Detecting Web
Spam
Link-Based
Detection
Content-Based
Detection
Using Links and
Contents
Using the Web
Topology
Conclusions
84. Web Spam
Detection Idea 1: Clustering (cont.)
C. Castillo,
D. Donato,
A. Gionis,
V. Murdock,
F. Silvestri Clustering:
Spam on the
Web
Detecting Web
Spam
Link-Based
Detection
Content-Based
Detection
Using Links and
Contents
Using the Web
Topology
Conclusions
85. Web Spam
Detection Idea 1: Clustering (cont.)
C. Castillo,
D. Donato,
A. Gionis,
V. Murdock,
F. Silvestri Final prediction:
Spam on the
Web
Detecting Web
Spam
Link-Based
Detection
Content-Based
Detection
Using Links and
Contents
Using the Web
Topology
Conclusions
86. Web Spam
Detection Idea 1: Clustering – Results
C. Castillo,
D. Donato,
A. Gionis,
V. Murdock,
F. Silvestri
Spam on the Baseline Clustering
Web
Detecting Web
Without bagging
Spam True positive rate 75.6% 74.5%
Link-Based
Detection
False positive rate 8.5% 6.8%
Content-Based F-Measure 0.646 0.673
Detection
With bagging
Using Links and
Contents True positive rate 78.7% 76.9%
Using the Web False positive rate 5.7% 5.0%
Topology
F-Measure 0.723 0.728
Conclusions
V Reduces error rate
87. Web Spam
Detection Idea 2: Propagate the label
C. Castillo,
D. Donato,
A. Gionis,
V. Murdock,
F. Silvestri
Spam on the
Web
Detecting Web
Spam
Link-Based
Detection
Classify, then interpret “spamicity” as a probability, then do a
Content-Based
Detection random walk with restart from those nodes
Using Links and
Contents
Using the Web
Topology
Conclusions
88. Web Spam
Detection Idea 2: Propagate the label (cont.)
C. Castillo,
D. Donato,
A. Gionis,
V. Murdock,
F. Silvestri
Spam on the Initial prediction:
Web
Detecting Web
Spam
Link-Based
Detection
Content-Based
Detection
Using Links and
Contents
Using the Web
Topology
Conclusions
89. Web Spam
Detection Idea 2: Propagate the label (cont.)
C. Castillo,
D. Donato,
A. Gionis,
V. Murdock,
F. Silvestri
Spam on the Propagation:
Web
Detecting Web
Spam
Link-Based
Detection
Content-Based
Detection
Using Links and
Contents
Using the Web
Topology
Conclusions
90. Web Spam
Detection Idea 2: Propagate the label (cont.)
C. Castillo,
D. Donato,
A. Gionis,
V. Murdock,
F. Silvestri
Spam on the Final prediction, applying a threshold:
Web
Detecting Web
Spam
Link-Based
Detection
Content-Based
Detection
Using Links and
Contents
Using the Web
Topology
Conclusions
91. Web Spam
Detection Idea 2: Propagate the label – Results
C. Castillo,
D. Donato,
A. Gionis,
V. Murdock,
F. Silvestri
Spam on the
Web Baseline Fwds. Backwds. Both
Detecting Web
Spam
Classifier without bagging
Link-Based True positive rate 75.6% 70.9% 69.4% 71.4%
Detection
False positive rate 8.5% 6.1% 5.8% 5.8%
Content-Based
Detection F-Measure 0.646 0.665 0.664 0.676
Using Links and Classifier with bagging
Contents
True positive rate 78.7% 76.5% 75.0% 75.2%
Using the Web
Topology False positive rate 5.7% 5.4% 4.3% 4.7%
Conclusions F-Measure 0.723 0.716 0.733 0.724
92. Web Spam
Detection Idea 3: Stacked graphical learning
C. Castillo,
D. Donato,
A. Gionis,
V. Murdock,
F. Silvestri
Spam on the
Web
Detecting Web
Spam
Meta-learning scheme [Cohen and Kou, 2006]
Link-Based
Detection Derive initial predictions
Content-Based
Detection Generate an additional attribute for each object by
Using Links and combining predictions on neighbors in the graph
Contents
Using the Web
Append additional attribute in the data and retrain
Topology
Conclusions
93. Web Spam
Detection Idea 3: Stacked graphical learning (cont.)
C. Castillo,
D. Donato,
A. Gionis,
V. Murdock,
F. Silvestri
Spam on the
Web
Let p(x) ∈ [0..1] be the prediction of a classification
Detecting Web
Spam algorithm for a host x using k features
Link-Based
Detection Let N(x) be the set of pages related to x (in some way)
Content-Based Compute
Detection
g ∈N(x) p(g )
Using Links and f (x) =
Contents
|N(x)|
Using the Web
Topology Add f (x) as an extra feature for instance x and learn a
Conclusions new model with k + 1 features
94. Web Spam
Detection Idea 3: Stacked graphical learning (cont.)
C. Castillo,
D. Donato,
A. Gionis,
V. Murdock,
F. Silvestri
Spam on the
Web
Detecting Web
Spam
Initial prediction:
Link-Based
Detection
Content-Based
Detection
Using Links and
Contents
Using the Web
Topology
Conclusions
95. Web Spam
Detection Idea 3: Stacked graphical learning (cont.)
C. Castillo,
D. Donato,
A. Gionis,
V. Murdock,
F. Silvestri
Spam on the
Web
Computation of new feature:
Detecting Web
Spam
Link-Based
Detection
Content-Based
Detection
Using Links and
Contents
Using the Web
Topology
Conclusions
96. Web Spam
Detection Idea 3: Stacked graphical learning (cont.)
C. Castillo,
D. Donato,
A. Gionis,
V. Murdock,
F. Silvestri
Spam on the
New prediction with k + 1 features:
Web
Detecting Web
Spam
Link-Based
Detection
Content-Based
Detection
Using Links and
Contents
Using the Web
Topology
Conclusions
97. Web Spam
Detection Idea 3: Stacked graphical learning - Results
C. Castillo,
D. Donato,
A. Gionis,
V. Murdock,
F. Silvestri
Spam on the
Web
Detecting Web
Spam
Avg. Avg. Avg.
Link-Based
Baseline of in of out of both
Detection
True positive rate 78.7% 84.4% 78.3% 85.2%
Content-Based
Detection False positive rate 5.7% 6.7% 4.8% 6.1%
Using Links and F-Measure 0.723 0.733 0.742 0.750
Contents
Using the Web
Topology V Increases detection rate
Conclusions
98. Web Spam
Detection Idea 3: Stacked graphical learning x2
C. Castillo,
D. Donato,
A. Gionis,
V. Murdock,
F. Silvestri
Spam on the
Web
Detecting Web And repeat ...
Spam
Link-Based
Detection
Baseline First pass Second pass
Content-Based True positive rate 78.7% 85.2% 88.4%
Detection
False positive rate 5.7% 6.1% 6.3%
Using Links and
Contents F-Measure 0.723 0.750 0.763
Using the Web
Topology
V Significant improvement over the baseline
Conclusions
99. Web Spam
Detection
C. Castillo,
D. Donato,
A. Gionis,
V. Murdock,
F. Silvestri
Spam on the
Web 1 Spam on the Web
Detecting Web
2 Detecting Web Spam
Spam 3 Link-Based Detection
Link-Based 4 Content-Based Detection
Detection
5 Using Links and Contents
Content-Based
Detection 6 Using the Web Topology
Using Links and 7 Conclusions
Contents
Using the Web
Topology
Conclusions
100. Web Spam
Detection Concluding remarks
C. Castillo,
D. Donato,
A. Gionis,
V. Murdock,
F. Silvestri
Spam on the
Web
Detecting Web
Spam
Link-Based
Detection V Considering content-based and link-based attributes
Content-Based improves the accuracy of the classifier
Detection
Using Links and
V Considering the links among pages improves the accuracy
Contents
Using the Web
Topology
Conclusions
101. Web Spam
Detection Concluding remarks
C. Castillo,
D. Donato,
A. Gionis,
V. Murdock,
F. Silvestri
Spam on the
Web
Detecting Web
Spam
Link-Based
Detection V Considering content-based and link-based attributes
Content-Based improves the accuracy of the classifier
Detection
Using Links and
V Considering the links among pages improves the accuracy
Contents
Using the Web
Topology
Conclusions
102. Web Spam
Detection
C. Castillo,
D. Donato,
A. Gionis,
V. Murdock,
F. Silvestri
i Web Spam Dataset: http://www.yr-bcn.es/webspam/
Spam on the
Web i Web Spam Challenge I & II: http://webspam.lip6.fr/
Detecting Web
Spam
i AIRWeb Workshop: http://airweb.cse.lehigh.edu/
Link-Based i GraphLab at ECML/PKDD: http://graphlab.lip6.fr/
Detection
Content-Based
Detection
B Newsletter: webspam-announces@yahoogroups.com
Using Links and
Contents
Using the Web
Topology
Conclusions Thank you!
103. Web Spam
Detection
C. Castillo,
D. Donato,
A. Gionis,
V. Murdock,
F. Silvestri
i Web Spam Dataset: http://www.yr-bcn.es/webspam/
Spam on the
Web i Web Spam Challenge I & II: http://webspam.lip6.fr/
Detecting Web
Spam
i AIRWeb Workshop: http://airweb.cse.lehigh.edu/
Link-Based i GraphLab at ECML/PKDD: http://graphlab.lip6.fr/
Detection
Content-Based
Detection
B Newsletter: webspam-announces@yahoogroups.com
Using Links and
Contents
Using the Web
Topology
Conclusions Thank you!
104. Web Spam
Detection Becchetti, L., Castillo, C., Donato, D., Leonardi, S., and Baeza-Yates, R.
C. Castillo, (2006).
D. Donato, Using rank propagation and probabilistic counting for link-based spam
A. Gionis,
V. Murdock, detection.
F. Silvestri In Proceedings of the Workshop on Web Mining and Web Usage Analysis
(WebKDD), Pennsylvania, USA. ACM Press.
Spam on the
Web Cohen, W. W. and Kou, Z. (2006).
Detecting Web Stacked graphical learning: approximating learning in markov random
Spam fields using very short inhomogeneous markov chains.
Link-Based
Technical report.
Detection
Davison, B. D. (2000).
Content-Based Topical locality in the web.
Detection
In Proceedings of the 23rd annual international ACM SIGIR conference on
Using Links and research and development in information retrieval, pages 272–279, Athens,
Contents
Greece. ACM Press.
Using the Web
Topology Fetterly, D., Manasse, M., and Najork, M. (2004).
Spam, damn spam, and statistics: Using statistical analysis to locate spam
Conclusions
web pages.
In Proceedings of the seventh workshop on the Web and databases
(WebDB), pages 1–6, Paris, France.
Flajolet, P. and Martin, N. G. (1985).
Probabilistic counting algorithms for data base applications.
Journal of Computer and System Sciences, 31(2):182–209.
105. Web Spam
Detection
C. Castillo,
D. Donato,
A. Gionis, Gibson, D., Kumar, R., and Tomkins, A. (2005).
V. Murdock, Discovering large dense subgraphs in massive graphs.
F. Silvestri
In VLDB ’05: Proceedings of the 31st international conference on Very
Spam on the large data bases, pages 721–732. VLDB Endowment.
Web
Gy¨ngyi, Z., Garcia-Molina, H., and Pedersen, J. (2004).
o
Detecting Web Combating Web spam with TrustRank.
Spam
In Proceedings of the 30th International Conference on Very Large Data
Link-Based Bases (VLDB), pages 576–587, Toronto, Canada. Morgan Kaufmann.
Detection
Content-Based Ntoulas, A., Najork, M., Manasse, M., and Fetterly, D. (2006).
Detection Detecting spam web pages through content analysis.
Using Links and In Proceedings of the World Wide Web conference, pages 83–92,
Contents Edinburgh, Scotland.
Using the Web Palmer, C. R., Gibbons, P. B., and Faloutsos, C. (2002).
Topology
ANF: a fast and scalable tool for data mining in massive graphs.
Conclusions In Proceedings of the eighth ACM SIGKDD international conference on
Knowledge discovery and data mining, pages 81–90, New York, NY, USA.
ACM Press.