SlideShare ist ein Scribd-Unternehmen logo
1 von 105
Downloaden Sie, um offline zu lesen
Web Spam
  Detection

 C. Castillo,
 D. Donato,
  A. Gionis,
 V. Murdock,
  F. Silvestri

Spam on the
Web

Detecting Web
Spam
                            Know your Neighbors
Link-Based
Detection         Web Spam Detection Using the Web Topology
Content-Based
Detection

Using Links and
Contents             Carlos Castillo1 , Debora Donato1 , Aristides Gionis1 ,
Using the Web
Topology
                            Vanessa Murdock1 , Fabrizio Silvestri2
Conclusions
                           1. Yahoo! Research Barcelona – Catalunya, Spain
                                      2. ISTI-CNR –Pisa,Italy


                           ACM SIGIR, 25 July 2007, Amsterdam
Web Spam
  Detection

 C. Castillo,
 D. Donato,
  A. Gionis,
                  1   Spam on the Web
 V. Murdock,
  F. Silvestri
                  2   Detecting Web Spam
Spam on the
Web

Detecting Web
Spam              3   Link-Based Detection
Link-Based
Detection

Content-Based
                  4   Content-Based Detection
Detection

Using Links and
Contents          5   Using Links and Contents
Using the Web
Topology

Conclusions       6   Using the Web Topology

                  7   Conclusions
Web Spam
  Detection

 C. Castillo,
 D. Donato,
  A. Gionis,
 V. Murdock,
  F. Silvestri

Spam on the
Web               1   Spam on the Web
Detecting Web
                  2   Detecting Web Spam
Spam              3   Link-Based Detection
Link-Based        4   Content-Based Detection
Detection
                  5   Using Links and Contents
Content-Based
Detection         6   Using the Web Topology
Using Links and   7   Conclusions
Contents

Using the Web
Topology

Conclusions
Web Spam
  Detection       What is on the Web?
 C. Castillo,
 D. Donato,
  A. Gionis,
 V. Murdock,
  F. Silvestri

Spam on the
Web

Detecting Web
Spam

Link-Based
Detection

Content-Based
Detection

Using Links and
Contents

Using the Web
Topology

Conclusions
Web Spam
  Detection       What is on the Web [2.0]?
 C. Castillo,
 D. Donato,
  A. Gionis,
 V. Murdock,
  F. Silvestri

Spam on the
Web

Detecting Web
Spam

Link-Based
Detection

Content-Based
Detection

Using Links and
Contents

Using the Web
Topology

Conclusions
Web Spam
  Detection       What else is on the Web?
 C. Castillo,
 D. Donato,
  A. Gionis,
 V. Murdock,
  F. Silvestri

Spam on the
Web

Detecting Web
Spam

Link-Based
Detection

Content-Based
Detection

Using Links and
Contents

Using the Web
Topology

Conclusions




                                Source: www.milliondollarhomepage.com
Web Spam
  Detection       What’s happening on the Web?
 C. Castillo,
 D. Donato,
  A. Gionis,
 V. Murdock,
  F. Silvestri

Spam on the
Web

Detecting Web
Spam

Link-Based
Detection
                   There is a fierce competition
Content-Based
Detection

Using Links and
Contents
                        for your attention
Using the Web
Topology

Conclusions
Web Spam
  Detection       What’s happening on the Web?
 C. Castillo,
 D. Donato,
  A. Gionis,
 V. Murdock,
  F. Silvestri

Spam on the
Web

Detecting Web
Spam

Link-Based
                         Search engines are to some extent
Detection

Content-Based
Detection
                            arbiters of this competition
Using Links and
Contents            and they must watch it closely, otherwise ...
Using the Web
Topology

Conclusions
Web Spam
  Detection       Some cheating occurs
 C. Castillo,
 D. Donato,
  A. Gionis,
 V. Murdock,
  F. Silvestri

Spam on the
Web

Detecting Web
Spam

Link-Based
Detection

Content-Based
Detection

Using Links and
Contents

Using the Web
Topology

Conclusions




                               1986 FIFA World Cup, Argentina vs England
Web Spam
  Detection       Simple web spam
 C. Castillo,
 D. Donato,
  A. Gionis,
 V. Murdock,
  F. Silvestri

Spam on the
Web

Detecting Web
Spam

Link-Based
Detection

Content-Based
Detection

Using Links and
Contents

Using the Web
Topology

Conclusions
Web Spam
  Detection       Hidden text
 C. Castillo,
 D. Donato,
  A. Gionis,
 V. Murdock,
  F. Silvestri

Spam on the
Web

Detecting Web
Spam

Link-Based
Detection

Content-Based
Detection

Using Links and
Contents

Using the Web
Topology

Conclusions
Web Spam
  Detection       Made for advertising
 C. Castillo,
 D. Donato,
  A. Gionis,
 V. Murdock,
  F. Silvestri

Spam on the
Web

Detecting Web
Spam

Link-Based
Detection

Content-Based
Detection

Using Links and
Contents

Using the Web
Topology

Conclusions
Web Spam
  Detection       Search engine?
 C. Castillo,
 D. Donato,
  A. Gionis,
 V. Murdock,
  F. Silvestri

Spam on the
Web

Detecting Web
Spam

Link-Based
Detection

Content-Based
Detection

Using Links and
Contents

Using the Web
Topology

Conclusions
Web Spam
  Detection       Fake search engine
 C. Castillo,
 D. Donato,
  A. Gionis,
 V. Murdock,
  F. Silvestri

Spam on the
Web

Detecting Web
Spam

Link-Based
Detection

Content-Based
Detection

Using Links and
Contents

Using the Web
Topology

Conclusions
Web Spam
  Detection       “Normal” content in link farms
 C. Castillo,
 D. Donato,
  A. Gionis,
 V. Murdock,
  F. Silvestri

Spam on the
Web

Detecting Web
Spam

Link-Based
Detection

Content-Based
Detection

Using Links and
Contents

Using the Web
Topology

Conclusions
Web Spam
  Detection       There are many attempts of cheating on the Web
 C. Castillo,
 D. Donato,
  A. Gionis,
 V. Murdock,
  F. Silvestri

Spam on the
Web

Detecting Web
Spam
                  Most of these are spam:
Link-Based
Detection
                      1,630,000 results for “free mp3 hilton viagra” in SE1
Content-Based
Detection             1,760,000 results for “credit vicodin loan” in SE2
Using Links and
Contents              1,320,000 results for “porn mortgage” in SE3
Using the Web
Topology

Conclusions
Web Spam
  Detection       Costs
 C. Castillo,
 D. Donato,
  A. Gionis,
 V. Murdock,
  F. Silvestri

Spam on the
Web

Detecting Web     Costs:
Spam

Link-Based          X Costs for users: lower precision for some queries
Detection

Content-Based       X Costs for search engines: wasted storage space,
Detection
                      network resources, and processing cycles
Using Links and
Contents            X Costs for the publishers: resources invested in cheating
Using the Web
Topology
                      and not in improving their contents
Conclusions
Web Spam
  Detection       Cheating on the Web
 C. Castillo,
 D. Donato,
  A. Gionis,
 V. Murdock,
  F. Silvestri
                   Z Link spam
Spam on the
Web                Z Content spam
Detecting Web         Spam-oriented blogging
Spam

Link-Based            Comment/forum/Wiki spam
Detection

Content-Based
                      Malicious cloaking
Detection
                      Click fraud ×2
Using Links and
Contents              Malicious tagging
Using the Web
Topology              . . . more?
Conclusions
                  Adversarial relationship
                  Every undeserved gain in ranking for a spammer, is a loss of
                  precision for the search engine.
Web Spam
  Detection       Cheating on the Web
 C. Castillo,
 D. Donato,
  A. Gionis,
 V. Murdock,
  F. Silvestri
                   Z Link spam
Spam on the
Web                Z Content spam
Detecting Web         Spam-oriented blogging
Spam

Link-Based            Comment/forum/Wiki spam
Detection

Content-Based
                      Malicious cloaking
Detection
                      Click fraud ×2
Using Links and
Contents              Malicious tagging
Using the Web
Topology              . . . more?
Conclusions
                  Adversarial relationship
                  Every undeserved gain in ranking for a spammer, is a loss of
                  precision for the search engine.
Web Spam
  Detection       Cheating on the Web
 C. Castillo,
 D. Donato,
  A. Gionis,
 V. Murdock,
  F. Silvestri
                   Z Link spam
Spam on the
Web                Z Content spam
Detecting Web         Spam-oriented blogging
Spam

Link-Based            Comment/forum/Wiki spam
Detection

Content-Based
                      Malicious cloaking
Detection
                      Click fraud ×2
Using Links and
Contents              Malicious tagging
Using the Web
Topology              . . . more?
Conclusions
                  Adversarial relationship
                  Every undeserved gain in ranking for a spammer, is a loss of
                  precision for the search engine.
Web Spam
  Detection       Cheating on the Web
 C. Castillo,
 D. Donato,
  A. Gionis,
 V. Murdock,
  F. Silvestri
                   Z Link spam
Spam on the
Web                Z Content spam
Detecting Web         Spam-oriented blogging
Spam

Link-Based            Comment/forum/Wiki spam
Detection

Content-Based
                      Malicious cloaking
Detection
                      Click fraud ×2
Using Links and
Contents              Malicious tagging
Using the Web
Topology              . . . more?
Conclusions
                  Adversarial relationship
                  Every undeserved gain in ranking for a spammer, is a loss of
                  precision for the search engine.
Web Spam
  Detection       Cheating on the Web
 C. Castillo,
 D. Donato,
  A. Gionis,
 V. Murdock,
  F. Silvestri
                   Z Link spam
Spam on the
Web                Z Content spam
Detecting Web         Spam-oriented blogging
Spam

Link-Based            Comment/forum/Wiki spam
Detection

Content-Based
                      Malicious cloaking
Detection
                      Click fraud ×2
Using Links and
Contents              Malicious tagging
Using the Web
Topology              . . . more?
Conclusions
                  Adversarial relationship
                  Every undeserved gain in ranking for a spammer, is a loss of
                  precision for the search engine.
Web Spam
  Detection       Cheating on the Web
 C. Castillo,
 D. Donato,
  A. Gionis,
 V. Murdock,
  F. Silvestri
                   Z Link spam
Spam on the
Web                Z Content spam
Detecting Web         Spam-oriented blogging
Spam

Link-Based            Comment/forum/Wiki spam
Detection

Content-Based
                      Malicious cloaking
Detection
                      Click fraud ×2
Using Links and
Contents              Malicious tagging
Using the Web
Topology              . . . more?
Conclusions
                  Adversarial relationship
                  Every undeserved gain in ranking for a spammer, is a loss of
                  precision for the search engine.
Web Spam
  Detection       Cheating on the Web
 C. Castillo,
 D. Donato,
  A. Gionis,
 V. Murdock,
  F. Silvestri
                   Z Link spam
Spam on the
Web                Z Content spam
Detecting Web         Spam-oriented blogging
Spam

Link-Based            Comment/forum/Wiki spam
Detection

Content-Based
                      Malicious cloaking
Detection
                      Click fraud ×2
Using Links and
Contents              Malicious tagging
Using the Web
Topology              . . . more?
Conclusions
                  Adversarial relationship
                  Every undeserved gain in ranking for a spammer, is a loss of
                  precision for the search engine.
Web Spam
  Detection       Cheating on the Web
 C. Castillo,
 D. Donato,
  A. Gionis,
 V. Murdock,
  F. Silvestri
                   Z Link spam
Spam on the
Web                Z Content spam
Detecting Web         Spam-oriented blogging
Spam

Link-Based            Comment/forum/Wiki spam
Detection

Content-Based
                      Malicious cloaking
Detection
                      Click fraud ×2
Using Links and
Contents              Malicious tagging
Using the Web
Topology              . . . more?
Conclusions
                  Adversarial relationship
                  Every undeserved gain in ranking for a spammer, is a loss of
                  precision for the search engine.
Web Spam
  Detection       Cheating on the Web
 C. Castillo,
 D. Donato,
  A. Gionis,
 V. Murdock,
  F. Silvestri
                   Z Link spam
Spam on the
Web                Z Content spam
Detecting Web         Spam-oriented blogging
Spam

Link-Based            Comment/forum/Wiki spam
Detection

Content-Based
                      Malicious cloaking
Detection
                      Click fraud ×2
Using Links and
Contents              Malicious tagging
Using the Web
Topology              . . . more?
Conclusions
                  Adversarial relationship
                  Every undeserved gain in ranking for a spammer, is a loss of
                  precision for the search engine.
Web Spam
  Detection       Cheating on the Web
 C. Castillo,
 D. Donato,
  A. Gionis,
 V. Murdock,
  F. Silvestri
                   Z Link spam
Spam on the
Web                Z Content spam
Detecting Web         Spam-oriented blogging
Spam

Link-Based            Comment/forum/Wiki spam
Detection

Content-Based
                      Malicious cloaking
Detection
                      Click fraud ×2
Using Links and
Contents              Malicious tagging
Using the Web
Topology              . . . more?
Conclusions
                  Adversarial relationship
                  Every undeserved gain in ranking for a spammer, is a loss of
                  precision for the search engine.
Web Spam
  Detection

 C. Castillo,
 D. Donato,
  A. Gionis,
 V. Murdock,
  F. Silvestri

Spam on the
Web               1   Spam on the Web
Detecting Web
                  2   Detecting Web Spam
Spam              3   Link-Based Detection
Link-Based        4   Content-Based Detection
Detection
                  5   Using Links and Contents
Content-Based
Detection         6   Using the Web Topology
Using Links and   7   Conclusions
Contents

Using the Web
Topology

Conclusions
Web Spam
  Detection       Research on Web spam detection
 C. Castillo,
 D. Donato,
  A. Gionis,
 V. Murdock,
  F. Silvestri

Spam on the
Web
                            Web spam detection techniques
Detecting Web
Spam

Link-Based
Detection

Content-Based
Detection

Using Links and
Contents

Using the Web
Topology

Conclusions
Web Spam
  Detection       Spam, damn spam and statistics
 C. Castillo,
 D. Donato,
  A. Gionis,
 V. Murdock,
  F. Silvestri    [Fetterly et al., 2004] propose to study statistical
                  distributions: “in a number of these distributions, outlier
Spam on the
Web               values are associated with web spam”
Detecting Web
Spam

Link-Based
Detection

Content-Based
Detection

Using Links and
Contents

Using the Web
Topology

Conclusions
Web Spam
  Detection       Machine learning training
 C. Castillo,
 D. Donato,
  A. Gionis,
 V. Murdock,
  F. Silvestri

Spam on the
Web

Detecting Web
Spam

Link-Based
Detection

Content-Based
Detection

Using Links and
Contents

Using the Web
Topology

Conclusions
Web Spam
  Detection       Machine learning
 C. Castillo,
 D. Donato,
  A. Gionis,
 V. Murdock,
  F. Silvestri

Spam on the
Web

Detecting Web
Spam

Link-Based
Detection

Content-Based
Detection

Using Links and
Contents

Using the Web
Topology

Conclusions
Web Spam
  Detection       Challenges
 C. Castillo,
 D. Donato,
  A. Gionis,
 V. Murdock,
  F. Silvestri

Spam on the       Scalability
Web
                  + Machine Learning Challenges:
Detecting Web
Spam
                      Instances are not really independent (graph)
Link-Based
Detection             Training set is relatively small
Content-Based
Detection         + Information Retrieval Challenges:
Using Links and
Contents              It is hard to find out which features are relevant
Using the Web
Topology              Features can be aggregated in content units:
Conclusions           page/host/domain
                      Features can be propagated through the graph
Web Spam
  Detection       Challenges
 C. Castillo,
 D. Donato,
  A. Gionis,
 V. Murdock,
  F. Silvestri

Spam on the       Scalability
Web
                  + Machine Learning Challenges:
Detecting Web
Spam
                      Instances are not really independent (graph)
Link-Based
Detection             Training set is relatively small
Content-Based
Detection         + Information Retrieval Challenges:
Using Links and
Contents              It is hard to find out which features are relevant
Using the Web
Topology              Features can be aggregated in content units:
Conclusions           page/host/domain
                      Features can be propagated through the graph
Web Spam
  Detection       Challenges
 C. Castillo,
 D. Donato,
  A. Gionis,
 V. Murdock,
  F. Silvestri

Spam on the       Scalability
Web
                  + Machine Learning Challenges:
Detecting Web
Spam
                      Instances are not really independent (graph)
Link-Based
Detection             Training set is relatively small
Content-Based
Detection         + Information Retrieval Challenges:
Using Links and
Contents              It is hard to find out which features are relevant
Using the Web
Topology              Features can be aggregated in content units:
Conclusions           page/host/domain
                      Features can be propagated through the graph
Web Spam
  Detection       Challenges
 C. Castillo,
 D. Donato,
  A. Gionis,
 V. Murdock,
  F. Silvestri

Spam on the       Scalability
Web
                  + Machine Learning Challenges:
Detecting Web
Spam
                      Instances are not really independent (graph)
Link-Based
Detection             Training set is relatively small
Content-Based
Detection         + Information Retrieval Challenges:
Using Links and
Contents              It is hard to find out which features are relevant
Using the Web
Topology              Features can be aggregated in content units:
Conclusions           page/host/domain
                      Features can be propagated through the graph
Web Spam
  Detection       Challenges
 C. Castillo,
 D. Donato,
  A. Gionis,
 V. Murdock,
  F. Silvestri

Spam on the       Scalability
Web
                  + Machine Learning Challenges:
Detecting Web
Spam
                      Instances are not really independent (graph)
Link-Based
Detection             Training set is relatively small
Content-Based
Detection         + Information Retrieval Challenges:
Using Links and
Contents              It is hard to find out which features are relevant
Using the Web
Topology              Features can be aggregated in content units:
Conclusions           page/host/domain
                      Features can be propagated through the graph
Web Spam
  Detection       Challenges
 C. Castillo,
 D. Donato,
  A. Gionis,
 V. Murdock,
  F. Silvestri

Spam on the       Scalability
Web
                  + Machine Learning Challenges:
Detecting Web
Spam
                      Instances are not really independent (graph)
Link-Based
Detection             Training set is relatively small
Content-Based
Detection         + Information Retrieval Challenges:
Using Links and
Contents              It is hard to find out which features are relevant
Using the Web
Topology              Features can be aggregated in content units:
Conclusions           page/host/domain
                      Features can be propagated through the graph
Web Spam
  Detection       Challenges
 C. Castillo,
 D. Donato,
  A. Gionis,
 V. Murdock,
  F. Silvestri

Spam on the       Scalability
Web
                  + Machine Learning Challenges:
Detecting Web
Spam
                      Instances are not really independent (graph)
Link-Based
Detection             Training set is relatively small
Content-Based
Detection         + Information Retrieval Challenges:
Using Links and
Contents              It is hard to find out which features are relevant
Using the Web
Topology              Features can be aggregated in content units:
Conclusions           page/host/domain
                      Features can be propagated through the graph
Web Spam
  Detection       Challenges
 C. Castillo,
 D. Donato,
  A. Gionis,
 V. Murdock,
  F. Silvestri

Spam on the       Scalability
Web
                  + Machine Learning Challenges:
Detecting Web
Spam
                      Instances are not really independent (graph)
Link-Based
Detection             Training set is relatively small
Content-Based
Detection         + Information Retrieval Challenges:
Using Links and
Contents              It is hard to find out which features are relevant
Using the Web
Topology              Features can be aggregated in content units:
Conclusions           page/host/domain
                      Features can be propagated through the graph
Web Spam
  Detection       Challenges
 C. Castillo,
 D. Donato,
  A. Gionis,
 V. Murdock,
  F. Silvestri

Spam on the       Scalability
Web
                  + Machine Learning Challenges:
Detecting Web
Spam
                      Instances are not really independent (graph)
Link-Based
Detection             Training set is relatively small
Content-Based
Detection         + Information Retrieval Challenges:
Using Links and
Contents              It is hard to find out which features are relevant
Using the Web
Topology              Features can be aggregated in content units:
Conclusions           page/host/domain
                      Features can be propagated through the graph
Web Spam
  Detection       Training data
 C. Castillo,
 D. Donato,
  A. Gionis,
 V. Murdock,
  F. Silvestri

Spam on the
Web

Detecting Web
Spam
                    X It is hard for search engines to provide labeled data
Link-Based
Detection           X Even if they do, it will not reflect a consensus on what is
Content-Based
Detection
                      Web Spam
Using Links and     V Public Web Spam collection built by a group of
Contents
                      volunteers: http://www.yr-bcn.es/webspam/
Using the Web
Topology

Conclusions
Web Spam
  Detection       Training data
 C. Castillo,
 D. Donato,
  A. Gionis,
 V. Murdock,
  F. Silvestri

Spam on the
Web

Detecting Web
Spam
                    X It is hard for search engines to provide labeled data
Link-Based
Detection           X Even if they do, it will not reflect a consensus on what is
Content-Based
Detection
                      Web Spam
Using Links and     V Public Web Spam collection built by a group of
Contents
                      volunteers: http://www.yr-bcn.es/webspam/
Using the Web
Topology

Conclusions
Web Spam
  Detection       Training data
 C. Castillo,
 D. Donato,
  A. Gionis,
 V. Murdock,
  F. Silvestri

Spam on the
Web

Detecting Web
Spam
                    X It is hard for search engines to provide labeled data
Link-Based
Detection           X Even if they do, it will not reflect a consensus on what is
Content-Based
Detection
                      Web Spam
Using Links and     V Public Web Spam collection built by a group of
Contents
                      volunteers: http://www.yr-bcn.es/webspam/
Using the Web
Topology

Conclusions
Web Spam
  Detection

 C. Castillo,
 D. Donato,
  A. Gionis,
 V. Murdock,
  F. Silvestri

Spam on the
Web               1   Spam on the Web
Detecting Web
                  2   Detecting Web Spam
Spam              3   Link-Based Detection
Link-Based        4   Content-Based Detection
Detection
                  5   Using Links and Contents
Content-Based
Detection         6   Using the Web Topology
Using Links and   7   Conclusions
Contents

Using the Web
Topology

Conclusions
Web Spam
  Detection       “Link farms”
 C. Castillo,
 D. Donato,
  A. Gionis,
 V. Murdock,
  F. Silvestri

Spam on the
Web                                                Web
Detecting Web
Spam

Link-Based
Detection

Content-Based
Detection

Using Links and
Contents
                                                               Link farm
Using the Web
Topology

Conclusions


                                      Spam page
                  Single-level farms can be detected by searching groups of
                  nodes sharing their out-links [Gibson et al., 2005]
Web Spam
  Detection       Handling large-graphs
 C. Castillo,
 D. Donato,
  A. Gionis,
 V. Murdock,
  F. Silvestri

Spam on the
Web

Detecting Web
Spam

Link-Based
Detection
                      Memory size enough to hold some data per-node
Content-Based         Disk size enough to hold some data per-edge
Detection

Using Links and       A small number of passes over the data
Contents

Using the Web
Topology

Conclusions
Web Spam
  Detection       Semi-streaming model
 C. Castillo,
 D. Donato,
  A. Gionis,
 V. Murdock,
  F. Silvestri
                    1:   for node : 1 . . . N do
Spam on the
Web
                    2:     INITIALIZE-MEM(node)
Detecting Web       3:   end for
Spam
                    4:   for distance : 1 . . . d do {Iteration step}
Link-Based
Detection           5:     for src : 1 . . . N do {Follow links in the graph}
Content-Based       6:        for all links from src to dest do
Detection
                    7:           COMPUTE(src,dest)
Using Links and
Contents            8:        end for
Using the Web       9:     end for
Topology
                   10:     NORMALIZE
Conclusions
                   11:   end for
                   12:   POST-PROCESS
                   13:   return Something
Web Spam
  Detection       Semi-streaming model
 C. Castillo,
 D. Donato,
  A. Gionis,
 V. Murdock,
  F. Silvestri
                    1:   for node : 1 . . . N do
Spam on the
Web
                    2:     INITIALIZE-MEM(node)
Detecting Web       3:   end for
Spam
                    4:   for distance : 1 . . . d do {Iteration step}
Link-Based
Detection           5:     for src : 1 . . . N do {Follow links in the graph}
Content-Based       6:        for all links from src to dest do
Detection
                    7:           COMPUTE(src,dest)
Using Links and
Contents            8:        end for
Using the Web       9:     end for
Topology
                   10:     NORMALIZE
Conclusions
                   11:   end for
                   12:   POST-PROCESS
                   13:   return Something
Web Spam
  Detection       Semi-streaming model
 C. Castillo,
 D. Donato,
  A. Gionis,
 V. Murdock,
  F. Silvestri
                    1:   for node : 1 . . . N do
Spam on the
Web
                    2:     INITIALIZE-MEM(node)
Detecting Web       3:   end for
Spam
                    4:   for distance : 1 . . . d do {Iteration step}
Link-Based
Detection           5:     for src : 1 . . . N do {Follow links in the graph}
Content-Based       6:        for all links from src to dest do
Detection
                    7:           COMPUTE(src,dest)
Using Links and
Contents            8:        end for
Using the Web       9:     end for
Topology
                   10:     NORMALIZE
Conclusions
                   11:   end for
                   12:   POST-PROCESS
                   13:   return Something
Web Spam
  Detection       Link-based features
 C. Castillo,
 D. Donato,
  A. Gionis,
 V. Murdock,
  F. Silvestri

Spam on the
Web

Detecting Web
Spam                  Degree-related measures
Link-Based            PageRank
Detection

Content-Based         TrustRank [Gy¨ngyi et al., 2004]
                                   o
Detection

Using Links and
                      Truncated PageRank [Becchetti et al., 2006]
Contents
                      Estimation of supporters [Becchetti et al., 2006]
Using the Web
Topology
                            140 features per host (2 pages per host)
Conclusions
Web Spam
  Detection       Degree-Based
 C. Castillo,
 D. Donato,
  A. Gionis,
 V. Murdock,                     0.12
  F. Silvestri                                                                                Normal
                                                                                               Spam


                                 0.10


Spam on the
Web                              0.08




Detecting Web                    0.06


Spam
                                 0.04

Link-Based
Detection                        0.02




Content-Based                    0.00
                                        4    18    76   323   1380   5899   25212   107764   460609    1968753
Detection
                                 0.14
                                                                                              Normal
Using Links and                                                                                Spam


Contents                         0.12



                                 0.10
Using the Web
Topology                         0.08



Conclusions                      0.06



                                 0.04



                                 0.02



                                 0.00
                                     0.0    0.0   0.0   0.1   0.6    4.9    40.0    327.9    2686.5    22009.9
Web Spam
  Detection       TrustRank
 C. Castillo,
 D. Donato,
  A. Gionis,
 V. Murdock,
  F. Silvestri

Spam on the
Web

Detecting Web
Spam

Link-Based
Detection

Content-Based
Detection

Using Links and
Contents

Using the Web
Topology

Conclusions




                  [Gy¨ngyi et al., 2004]
                     o
Web Spam
  Detection       TrustRank / PageRank
 C. Castillo,
 D. Donato,
  A. Gionis,
 V. Murdock,
  F. Silvestri

Spam on the
Web
                                         1.00
                                                                                                      Normal
Detecting Web                            0.90
                                                                                                       Spam


Spam
                                         0.80


Link-Based                               0.70


Detection                                0.60


                                         0.50
Content-Based
                                         0.40
Detection
                                         0.30

Using Links and                          0.20

Contents                                 0.10



Using the Web                            0.00
                                                0.4   1   4   1e+01   4e+01   1e+02   3e+02   1e+03   3e+03    9e+03

Topology

Conclusions
Web Spam
  Detection       Truncated PageRank
 C. Castillo,
 D. Donato,
  A. Gionis,
 V. Murdock,
  F. Silvestri
                  Proposed in [Becchetti et al., 2006]. Idea: reduce the direct
Spam on the
Web               contribution of the first levels of links:
Detecting Web
Spam

Link-Based
Detection

Content-Based
Detection

Using Links and
Contents

Using the Web
Topology
                                                   0      t≤T
Conclusions                       damping(t) =
                                                   C αt   t>T
                  V No extra reading of the graph after PageRank
Web Spam
  Detection       Truncated PageRank
 C. Castillo,
 D. Donato,
  A. Gionis,
 V. Murdock,
  F. Silvestri
                  Proposed in [Becchetti et al., 2006]. Idea: reduce the direct
Spam on the
Web               contribution of the first levels of links:
Detecting Web
Spam

Link-Based
Detection

Content-Based
Detection

Using Links and
Contents

Using the Web
Topology
                                                   0      t≤T
Conclusions                       damping(t) =
                                                   C αt   t>T
                  V No extra reading of the graph after PageRank
Web Spam
  Detection       Hop-plot
 C. Castillo,
 D. Donato,
  A. Gionis,
 V. Murdock,
  F. Silvestri

Spam on the
Web

Detecting Web
Spam

Link-Based
Detection

Content-Based
Detection

Using Links and
Contents

Using the Web
Topology

Conclusions
Web Spam
  Detection       High and low-ranked pages are different
 C. Castillo,
 D. Donato,
  A. Gionis,                                      4
 V. Murdock,
                                               x 10
  F. Silvestri                                                       Top 0%−10%
                                          12
                                                                     Top 40%−50%
Spam on the
Web
                                                                     Top 60%−70%
                                          10
                        Number of Nodes
Detecting Web
Spam

Link-Based
                                          8
Detection

Content-Based                             6
Detection

Using Links and
Contents                                  4
Using the Web
Topology                                  2
Conclusions

                                          0
                                           1          5     10       15        20
                                                          Distance
                  Areas below the curves are equal if we are in the same
                  strongly-connected component
Web Spam
  Detection       High and low-ranked pages are different
 C. Castillo,
 D. Donato,
  A. Gionis,                                      4
 V. Murdock,
                                               x 10
  F. Silvestri                                                       Top 0%−10%
                                          12
                                                                     Top 40%−50%
Spam on the
Web
                                                                     Top 60%−70%
                                          10
                        Number of Nodes
Detecting Web
Spam

Link-Based
                                          8
Detection

Content-Based                             6
Detection

Using Links and
Contents                                  4
Using the Web
Topology                                  2
Conclusions

                                          0
                                           1          5     10       15        20
                                                          Distance
                  Areas below the curves are equal if we are in the same
                  strongly-connected component
Web Spam
  Detection       Probabilistic counting
 C. Castillo,
 D. Donato,
  A. Gionis,
 V. Murdock,
  F. Silvestri             1                                           1
                           0                                           0
                           0                                           0
Spam on the
                           0                                           0
Web                                1     0                                     1         1
                           1                                           1
                                   1     0                             0       0         1
Detecting Web              0
                                   0     0                                     0         0
Spam                               0             Propagation of                0         1
                                         1
                                   0              bits using the               1         1
Link-Based                               1
                                   0            “OR” operation                 1         1
Detection                                0

Content-Based                  0       Target                              1       Count bits set
Detection                      0        page                               0        to estimate
                               0                                           0        supporters
Using Links and
                               0                                           0
Contents               1                                           1
                               1                                           1
                       0                                           0       1
Using the Web                  1
                       0                                           0
Topology               0                                           0
                       1                                           1
Conclusions
                       0                                           0




                  [Becchetti et al., 2006] shows an improvement of ANF
                  algorithm [Palmer et al., 2002] based on probabilistic
                  counting [Flajolet and Martin, 1985]
Web Spam
  Detection       Probabilistic counting
 C. Castillo,
 D. Donato,
  A. Gionis,
 V. Murdock,
  F. Silvestri             1                                           1
                           0                                           0
                           0                                           0
Spam on the
                           0                                           0
Web                                1     0                                     1         1
                           1                                           1
                                   1     0                             0       0         1
Detecting Web              0
                                   0     0                                     0         0
Spam                               0             Propagation of                0         1
                                         1
                                   0              bits using the               1         1
Link-Based                               1
                                   0            “OR” operation                 1         1
Detection                                0

Content-Based                  0       Target                              1       Count bits set
Detection                      0        page                               0        to estimate
                               0                                           0        supporters
Using Links and
                               0                                           0
Contents               1                                           1
                               1                                           1
                       0                                           0       1
Using the Web                  1
                       0                                           0
Topology               0                                           0
                       1                                           1
Conclusions
                       0                                           0




                  [Becchetti et al., 2006] shows an improvement of ANF
                  algorithm [Palmer et al., 2002] based on probabilistic
                  counting [Flajolet and Martin, 1985]
Web Spam
  Detection       Bottleneck number
 C. Castillo,
 D. Donato,
  A. Gionis,
 V. Murdock,
  F. Silvestri

Spam on the
Web

Detecting Web
Spam
                                     |Nj (x)|
Link-Based        bd (x) = minj≤d   |Nj−1 (x)|   .
Detection

Content-Based
Detection         Minimum rate of growth of the neighbors of x up to a certain
Using Links and
Contents
                  distance.
Using the Web
Topology

Conclusions
Web Spam
  Detection       Bottleneck number: spam
 C. Castillo,
 D. Donato,
  A. Gionis,
 V. Murdock,
  F. Silvestri

Spam on the
Web

Detecting Web
Spam

Link-Based
Detection

Content-Based
Detection

Using Links and
Contents

Using the Web
Topology

Conclusions
Web Spam
  Detection       Bottleneck number: normal
 C. Castillo,
 D. Donato,
  A. Gionis,
 V. Murdock,
  F. Silvestri

Spam on the
Web

Detecting Web
Spam

Link-Based
Detection

Content-Based
Detection

Using Links and
Contents

Using the Web
Topology

Conclusions
Web Spam
  Detection       Bottleneck number
 C. Castillo,
 D. Donato,
  A. Gionis,
 V. Murdock,
  F. Silvestri

Spam on the
Web

Detecting Web
                  bd (x) = minj≤d {|Nj (x)|/|Nj−1 (x)|}.
Spam                                            0.40
                                                                                                               Normal
                                                                                                                Spam

                                                0.35
Link-Based
Detection                                       0.30



                                                0.25
Content-Based
Detection                                       0.20



                                                0.15
Using Links and
Contents                                        0.10




Using the Web                                   0.05


Topology                                        0.00
                                                       1.11   1.30   1.52   1.78   2.07   2.42   2.83   3.31   3.87     4.52


Conclusions
Web Spam
  Detection

 C. Castillo,
 D. Donato,
  A. Gionis,
 V. Murdock,
  F. Silvestri

Spam on the
Web               1   Spam on the Web
Detecting Web
                  2   Detecting Web Spam
Spam              3   Link-Based Detection
Link-Based        4   Content-Based Detection
Detection
                  5   Using Links and Contents
Content-Based
Detection         6   Using the Web Topology
Using Links and   7   Conclusions
Contents

Using the Web
Topology

Conclusions
Web Spam
  Detection       Content-Based Features
 C. Castillo,
 D. Donato,
  A. Gionis,
 V. Murdock,
  F. Silvestri
                  Most of the features reported in [Ntoulas et al., 2006]
Spam on the
Web                   Number of word in the page and title
Detecting Web
Spam
                      Average word length
Link-Based            Fraction of anchor text
Detection

Content-Based
                      Fraction of visible text
Detection
                      Compression rate
Using Links and
Contents              Corpus precision and corpus recall
Using the Web
Topology              Query precision and query recall
Conclusions
                      Independent trigram likelihood
                      Entropy of trigrams
                                       96 features per host
Web Spam
  Detection       Content-based features (entropy related)
 C. Castillo,
 D. Donato,
  A. Gionis,
 V. Murdock,
  F. Silvestri

Spam on the
Web

Detecting Web
Spam              T = {(w1 , p1 ), . . . , (wk , pk )} the set of trigrams in a page,
Link-Based
Detection
                  where trigram wi has frequency pi
Content-Based
Detection         Features:
Using Links and
Contents               Entropy of trigrams H = −         wi ∈T   pi log pi
Using the Web
Topology
                       Also, compression rate, as measured by bzip
Conclusions
Web Spam
  Detection       Content-based features (related to popular
 C. Castillo,
 D. Donato,       keywords)
  A. Gionis,
 V. Murdock,
  F. Silvestri

Spam on the
Web
                  F set of most frequent terms in the collection
Detecting Web
Spam
                  Q set of most frequent terms in a query log
Link-Based
Detection         P set of terms in a page
Content-Based
Detection
                  Features:
Using Links and
Contents
                      Corpus “precision” |P ∩ F |/|P|
Using the Web
Topology              Corpus “recall”        |P ∩ F |/|F |
Conclusions
                      Query “precision”      |P ∩ Q|/|P|
                      Query “recall”         |P ∩ Q|/|Q|
Web Spam
  Detection       Average word length
 C. Castillo,
 D. Donato,
  A. Gionis,
 V. Murdock,
  F. Silvestri                0.12
                                                                        Normal
                                                                         Spam
Spam on the
                              0.10
Web

Detecting Web
                              0.08
Spam

Link-Based
Detection
                              0.06

Content-Based
Detection                     0.04

Using Links and
Contents                      0.02

Using the Web
Topology                      0.00
                                  3.0   3.5   4.0   4.5   5.0   5.5   6.0   6.5   7.0   7.5
Conclusions

                  Figure: Histogram of the average word length in non-spam vs.
                  spam pages for k = 500.
Web Spam
  Detection       Corpus precision
 C. Castillo,
 D. Donato,
  A. Gionis,
 V. Murdock,
  F. Silvestri                0.10
                                                                Normal
                              0.09                               Spam
Spam on the
Web                           0.08
Detecting Web                 0.07
Spam
                              0.06
Link-Based
Detection
                              0.05
                              0.04
Content-Based
Detection                     0.03
Using Links and               0.02
Contents
                              0.01
Using the Web
Topology                      0.00
                                  0.0   0.1   0.2   0.3   0.4   0.5      0.6   0.7
Conclusions

                  Figure: Histogram of the corpus precision in non-spam vs. spam
                  pages.
Web Spam
  Detection       Query precision
 C. Castillo,
 D. Donato,
  A. Gionis,
 V. Murdock,
  F. Silvestri                0.12
                                                            Normal
                                                             Spam
Spam on the
                              0.10
Web

Detecting Web
                              0.08
Spam

Link-Based
Detection
                              0.06

Content-Based
Detection                     0.04

Using Links and
Contents                      0.02

Using the Web
Topology                      0.00
                                  0.0   0.1   0.2   0.3   0.4    0.5   0.6
Conclusions

                  Figure: Histogram of the query precision in non-spam vs. spam
                  pages for k = 500.
Web Spam
  Detection

 C. Castillo,
 D. Donato,
  A. Gionis,
 V. Murdock,
  F. Silvestri

Spam on the
Web               1   Spam on the Web
Detecting Web
                  2   Detecting Web Spam
Spam              3   Link-Based Detection
Link-Based        4   Content-Based Detection
Detection
                  5   Using Links and Contents
Content-Based
Detection         6   Using the Web Topology
Using Links and   7   Conclusions
Contents

Using the Web
Topology

Conclusions
Web Spam
  Detection       Cost-sensitive decision tree with bagging
 C. Castillo,
 D. Donato,
  A. Gionis,
 V. Murdock,
  F. Silvestri

Spam on the
Web

Detecting Web
Spam              Bagging of 10 decision trees, asymmetrical costs.
Link-Based
Detection
                      Cost ratio          1       10      20       30      50
Content-Based
Detection         True positive rate    65.8%   66.7%   71.1%    78.7%   84.1%
Using Links and
Contents
                  False positive rate   2.8%    3.4%    4.5%     5.7%    8.6%
Using the Web
                      F-Measure         0.712   0.703   0.704    0.723   0.692
Topology

Conclusions
Web Spam
  Detection       Link- and content-based features
 C. Castillo,
 D. Donato,
  A. Gionis,
 V. Murdock,
  F. Silvestri

Spam on the
Web

Detecting Web
Spam              Link-based and content-based
Link-Based
Detection
                                            Both   Link-only   Content-only
Content-Based
Detection            True positive rate    78.7%    79.4%        64.9%
Using Links and      False positive rate   5.7%      9.0%         3.7%
Contents

Using the Web
                         F-Measure         0.723     0.659        0.683
Topology

Conclusions
Web Spam
  Detection

 C. Castillo,
 D. Donato,
  A. Gionis,
 V. Murdock,
  F. Silvestri

Spam on the
Web               1   Spam on the Web
Detecting Web
                  2   Detecting Web Spam
Spam              3   Link-Based Detection
Link-Based        4   Content-Based Detection
Detection
                  5   Using Links and Contents
Content-Based
Detection         6   Using the Web Topology
Using Links and   7   Conclusions
Contents

Using the Web
Topology

Conclusions
Web Spam
  Detection       Hypothesis
 C. Castillo,
 D. Donato,
  A. Gionis,
 V. Murdock,
  F. Silvestri

Spam on the
Web

Detecting Web
Spam              Pages topologically close to each other are more likely
Link-Based
Detection
                  to have the same label (spam/nonspam) than random
Content-Based
                  pairs of pages.
Detection

Using Links and
Contents
                  Pages linked together are more likely to be on the same topic
Using the Web
                  than random pairs of pages [Davison, 2000]
Topology

Conclusions
Web Spam
  Detection       Hypothesis
 C. Castillo,
 D. Donato,
  A. Gionis,
 V. Murdock,
  F. Silvestri

Spam on the
Web

Detecting Web
Spam              Pages topologically close to each other are more likely
Link-Based
Detection
                  to have the same label (spam/nonspam) than random
Content-Based
                  pairs of pages.
Detection

Using Links and
Contents
                  Pages linked together are more likely to be on the same topic
Using the Web
                  than random pairs of pages [Davison, 2000]
Topology

Conclusions
Web Spam
  Detection

 C. Castillo,
 D. Donato,
  A. Gionis,
 V. Murdock,
  F. Silvestri

Spam on the
Web

Detecting Web
Spam

Link-Based
Detection

Content-Based
Detection

Using Links and
Contents

Using the Web
Topology

Conclusions
Web Spam
  Detection       Topological dependencies: in-links
 C. Castillo,
 D. Donato,
  A. Gionis,
 V. Murdock,
  F. Silvestri
                  Histogram of fraction of spam hosts in the in-links
Spam on the
Web
                      0 = no in-link comes from spam hosts
Detecting Web         1 = all of the in-links come from spam hosts
Spam

Link-Based
Detection                         0.4
                                                    In-links of non spam
Content-Based                    0.35                     In-links of spam
Detection
                                  0.3
Using Links and
Contents                         0.25

Using the Web                     0.2
Topology
                                 0.15
Conclusions
                                  0.1

                                 0.05

                                   0
                                        0.0   0.2   0.4      0.6        0.8   1.0
Web Spam
  Detection       Topological dependencies: out-links
 C. Castillo,
 D. Donato,
  A. Gionis,
 V. Murdock,
  F. Silvestri
                  Histogram of fraction of spam hosts in the out-links
Spam on the
Web
                      0 = none of the out-links points to spam hosts
Detecting Web         1 = all of the out-links point to spam hosts
Spam

Link-Based
Detection                         1
                                                   Out-links of non spam
Content-Based                    0.9                     Outlinks of spam
Detection                        0.8

Using Links and                  0.7
Contents                         0.6
Using the Web                    0.5
Topology                         0.4
Conclusions                      0.3
                                 0.2
                                 0.1
                                  0
                                       0.0   0.2   0.4       0.6       0.8   1.0
Web Spam
  Detection       Idea 1: Clustering
 C. Castillo,
 D. Donato,
  A. Gionis,
 V. Murdock,
  F. Silvestri

Spam on the
Web

Detecting Web
Spam

Link-Based
Detection         Classify, then cluster hosts, then assign the same label to all
Content-Based
Detection
                  hosts in the same cluster by majority voting
Using Links and
Contents

Using the Web
Topology

Conclusions
Web Spam
  Detection       Idea 1: Clustering (cont.)
 C. Castillo,
 D. Donato,
  A. Gionis,
 V. Murdock,
  F. Silvestri    Initial prediction:
Spam on the
Web

Detecting Web
Spam

Link-Based
Detection

Content-Based
Detection

Using Links and
Contents

Using the Web
Topology

Conclusions
Web Spam
  Detection       Idea 1: Clustering (cont.)
 C. Castillo,
 D. Donato,
  A. Gionis,
 V. Murdock,
  F. Silvestri    Clustering:
Spam on the
Web

Detecting Web
Spam

Link-Based
Detection

Content-Based
Detection

Using Links and
Contents

Using the Web
Topology

Conclusions
Web Spam
  Detection       Idea 1: Clustering (cont.)
 C. Castillo,
 D. Donato,
  A. Gionis,
 V. Murdock,
  F. Silvestri    Final prediction:
Spam on the
Web

Detecting Web
Spam

Link-Based
Detection

Content-Based
Detection

Using Links and
Contents

Using the Web
Topology

Conclusions
Web Spam
  Detection       Idea 1: Clustering – Results
 C. Castillo,
 D. Donato,
  A. Gionis,
 V. Murdock,
  F. Silvestri

Spam on the                                      Baseline Clustering
Web

Detecting Web
                                         Without bagging
Spam                        True positive rate    75.6%     74.5%
Link-Based
Detection
                            False positive rate    8.5%     6.8%
Content-Based                   F-Measure         0.646     0.673
Detection
                                           With bagging
Using Links and
Contents                    True positive rate    78.7%     76.9%
Using the Web               False positive rate    5.7%     5.0%
Topology
                                F-Measure         0.723     0.728
Conclusions


                  V Reduces error rate
Web Spam
  Detection       Idea 2: Propagate the label
 C. Castillo,
 D. Donato,
  A. Gionis,
 V. Murdock,
  F. Silvestri

Spam on the
Web

Detecting Web
Spam

Link-Based
Detection
                  Classify, then interpret “spamicity” as a probability, then do a
Content-Based
Detection         random walk with restart from those nodes
Using Links and
Contents

Using the Web
Topology

Conclusions
Web Spam
  Detection       Idea 2: Propagate the label (cont.)
 C. Castillo,
 D. Donato,
  A. Gionis,
 V. Murdock,
  F. Silvestri

Spam on the       Initial prediction:
Web

Detecting Web
Spam

Link-Based
Detection

Content-Based
Detection

Using Links and
Contents

Using the Web
Topology

Conclusions
Web Spam
  Detection       Idea 2: Propagate the label (cont.)
 C. Castillo,
 D. Donato,
  A. Gionis,
 V. Murdock,
  F. Silvestri

Spam on the       Propagation:
Web

Detecting Web
Spam

Link-Based
Detection

Content-Based
Detection

Using Links and
Contents

Using the Web
Topology

Conclusions
Web Spam
  Detection       Idea 2: Propagate the label (cont.)
 C. Castillo,
 D. Donato,
  A. Gionis,
 V. Murdock,
  F. Silvestri

Spam on the       Final prediction, applying a threshold:
Web

Detecting Web
Spam

Link-Based
Detection

Content-Based
Detection

Using Links and
Contents

Using the Web
Topology

Conclusions
Web Spam
  Detection       Idea 2: Propagate the label – Results
 C. Castillo,
 D. Donato,
  A. Gionis,
 V. Murdock,
  F. Silvestri

Spam on the
Web                                       Baseline Fwds. Backwds.      Both
Detecting Web
Spam
                                     Classifier without bagging
Link-Based          True positive rate     75.6% 70.9%         69.4%   71.4%
Detection
                    False positive rate     8.5%      6.1%      5.8%    5.8%
Content-Based
Detection               F-Measure           0.646     0.665    0.664   0.676
Using Links and                         Classifier with bagging
Contents
                    True positive rate     78.7% 76.5%         75.0%   75.2%
Using the Web
Topology            False positive rate     5.7%      5.4%      4.3%    4.7%
Conclusions             F-Measure           0.723     0.716    0.733   0.724
Web Spam
  Detection       Idea 3: Stacked graphical learning
 C. Castillo,
 D. Donato,
  A. Gionis,
 V. Murdock,
  F. Silvestri

Spam on the
Web

Detecting Web
Spam
                      Meta-learning scheme [Cohen and Kou, 2006]
Link-Based
Detection             Derive initial predictions
Content-Based
Detection             Generate an additional attribute for each object by
Using Links and       combining predictions on neighbors in the graph
Contents

Using the Web
                      Append additional attribute in the data and retrain
Topology

Conclusions
Web Spam
  Detection       Idea 3: Stacked graphical learning (cont.)
 C. Castillo,
 D. Donato,
  A. Gionis,
 V. Murdock,
  F. Silvestri

Spam on the
Web
                      Let p(x) ∈ [0..1] be the prediction of a classification
Detecting Web
Spam                  algorithm for a host x using k features
Link-Based
Detection             Let N(x) be the set of pages related to x (in some way)
Content-Based         Compute
Detection
                                                   g ∈N(x) p(g )
Using Links and                         f (x) =
Contents
                                                    |N(x)|
Using the Web
Topology              Add f (x) as an extra feature for instance x and learn a
Conclusions           new model with k + 1 features
Web Spam
  Detection       Idea 3: Stacked graphical learning (cont.)
 C. Castillo,
 D. Donato,
  A. Gionis,
 V. Murdock,
  F. Silvestri

Spam on the
Web

Detecting Web
Spam
                  Initial prediction:
Link-Based
Detection

Content-Based
Detection

Using Links and
Contents

Using the Web
Topology

Conclusions
Web Spam
  Detection       Idea 3: Stacked graphical learning (cont.)
 C. Castillo,
 D. Donato,
  A. Gionis,
 V. Murdock,
  F. Silvestri

Spam on the
Web
                  Computation of new feature:
Detecting Web
Spam

Link-Based
Detection

Content-Based
Detection

Using Links and
Contents

Using the Web
Topology

Conclusions
Web Spam
  Detection       Idea 3: Stacked graphical learning (cont.)
 C. Castillo,
 D. Donato,
  A. Gionis,
 V. Murdock,
  F. Silvestri

Spam on the
                  New prediction with k + 1 features:
Web

Detecting Web
Spam

Link-Based
Detection

Content-Based
Detection

Using Links and
Contents

Using the Web
Topology

Conclusions
Web Spam
  Detection       Idea 3: Stacked graphical learning - Results
 C. Castillo,
 D. Donato,
  A. Gionis,
 V. Murdock,
  F. Silvestri

Spam on the
Web

Detecting Web
Spam
                                                       Avg.     Avg.      Avg.
Link-Based
                                           Baseline    of in   of out   of both
Detection
                     True positive rate     78.7%     84.4%    78.3%    85.2%
Content-Based
Detection            False positive rate     5.7%     6.7%     4.8%      6.1%
Using Links and          F-Measure          0.723     0.733    0.742     0.750
Contents

Using the Web
Topology          V Increases detection rate
Conclusions
Web Spam
  Detection       Idea 3: Stacked graphical learning x2
 C. Castillo,
 D. Donato,
  A. Gionis,
 V. Murdock,
  F. Silvestri

Spam on the
Web

Detecting Web     And repeat ...
Spam

Link-Based
Detection
                                           Baseline   First pass   Second pass
Content-Based        True positive rate     78.7%       85.2%         88.4%
Detection
                     False positive rate     5.7%        6.1%          6.3%
Using Links and
Contents                 F-Measure          0.723       0.750         0.763
Using the Web
Topology
                  V Significant improvement over the baseline
Conclusions
Web Spam
  Detection

 C. Castillo,
 D. Donato,
  A. Gionis,
 V. Murdock,
  F. Silvestri

Spam on the
Web               1   Spam on the Web
Detecting Web
                  2   Detecting Web Spam
Spam              3   Link-Based Detection
Link-Based        4   Content-Based Detection
Detection
                  5   Using Links and Contents
Content-Based
Detection         6   Using the Web Topology
Using Links and   7   Conclusions
Contents

Using the Web
Topology

Conclusions
Web Spam
  Detection       Concluding remarks
 C. Castillo,
 D. Donato,
  A. Gionis,
 V. Murdock,
  F. Silvestri

Spam on the
Web

Detecting Web
Spam

Link-Based
Detection           V   Considering content-based and link-based attributes
Content-Based           improves the accuracy of the classifier
Detection

Using Links and
                    V   Considering the links among pages improves the accuracy
Contents

Using the Web
Topology

Conclusions
Web Spam
  Detection       Concluding remarks
 C. Castillo,
 D. Donato,
  A. Gionis,
 V. Murdock,
  F. Silvestri

Spam on the
Web

Detecting Web
Spam

Link-Based
Detection           V   Considering content-based and link-based attributes
Content-Based           improves the accuracy of the classifier
Detection

Using Links and
                    V   Considering the links among pages improves the accuracy
Contents

Using the Web
Topology

Conclusions
Web Spam
  Detection

 C. Castillo,
 D. Donato,
  A. Gionis,
 V. Murdock,
  F. Silvestri
                  i   Web Spam Dataset: http://www.yr-bcn.es/webspam/
Spam on the
Web               i   Web Spam Challenge I & II: http://webspam.lip6.fr/
Detecting Web
Spam
                  i   AIRWeb Workshop: http://airweb.cse.lehigh.edu/
Link-Based        i   GraphLab at ECML/PKDD: http://graphlab.lip6.fr/
Detection

Content-Based
Detection
                  B   Newsletter: webspam-announces@yahoogroups.com
Using Links and
Contents

Using the Web
Topology

Conclusions                             Thank you!
Web Spam
  Detection

 C. Castillo,
 D. Donato,
  A. Gionis,
 V. Murdock,
  F. Silvestri
                  i   Web Spam Dataset: http://www.yr-bcn.es/webspam/
Spam on the
Web               i   Web Spam Challenge I & II: http://webspam.lip6.fr/
Detecting Web
Spam
                  i   AIRWeb Workshop: http://airweb.cse.lehigh.edu/
Link-Based        i   GraphLab at ECML/PKDD: http://graphlab.lip6.fr/
Detection

Content-Based
Detection
                  B   Newsletter: webspam-announces@yahoogroups.com
Using Links and
Contents

Using the Web
Topology

Conclusions                             Thank you!
Web Spam
  Detection       Becchetti, L., Castillo, C., Donato, D., Leonardi, S., and Baeza-Yates, R.
 C. Castillo,     (2006).
 D. Donato,       Using rank propagation and probabilistic counting for link-based spam
  A. Gionis,
 V. Murdock,      detection.
  F. Silvestri    In Proceedings of the Workshop on Web Mining and Web Usage Analysis
                  (WebKDD), Pennsylvania, USA. ACM Press.
Spam on the
Web               Cohen, W. W. and Kou, Z. (2006).
Detecting Web     Stacked graphical learning: approximating learning in markov random
Spam              fields using very short inhomogeneous markov chains.
Link-Based
                  Technical report.
Detection
                  Davison, B. D. (2000).
Content-Based     Topical locality in the web.
Detection
                  In Proceedings of the 23rd annual international ACM SIGIR conference on
Using Links and   research and development in information retrieval, pages 272–279, Athens,
Contents
                  Greece. ACM Press.
Using the Web
Topology          Fetterly, D., Manasse, M., and Najork, M. (2004).
                  Spam, damn spam, and statistics: Using statistical analysis to locate spam
Conclusions
                  web pages.
                  In Proceedings of the seventh workshop on the Web and databases
                  (WebDB), pages 1–6, Paris, France.
                  Flajolet, P. and Martin, N. G. (1985).
                  Probabilistic counting algorithms for data base applications.
                  Journal of Computer and System Sciences, 31(2):182–209.
Web Spam
  Detection

 C. Castillo,
 D. Donato,
  A. Gionis,      Gibson, D., Kumar, R., and Tomkins, A. (2005).
 V. Murdock,      Discovering large dense subgraphs in massive graphs.
  F. Silvestri
                  In VLDB ’05: Proceedings of the 31st international conference on Very
Spam on the       large data bases, pages 721–732. VLDB Endowment.
Web
                  Gy¨ngyi, Z., Garcia-Molina, H., and Pedersen, J. (2004).
                     o
Detecting Web     Combating Web spam with TrustRank.
Spam
                  In Proceedings of the 30th International Conference on Very Large Data
Link-Based        Bases (VLDB), pages 576–587, Toronto, Canada. Morgan Kaufmann.
Detection

Content-Based     Ntoulas, A., Najork, M., Manasse, M., and Fetterly, D. (2006).
Detection         Detecting spam web pages through content analysis.
Using Links and   In Proceedings of the World Wide Web conference, pages 83–92,
Contents          Edinburgh, Scotland.
Using the Web     Palmer, C. R., Gibbons, P. B., and Faloutsos, C. (2002).
Topology
                  ANF: a fast and scalable tool for data mining in massive graphs.
Conclusions       In Proceedings of the eighth ACM SIGKDD international conference on
                  Knowledge discovery and data mining, pages 81–90, New York, NY, USA.
                  ACM Press.

Weitere ähnliche Inhalte

Mehr von Carlos Castillo (ChaTo)

Mehr von Carlos Castillo (ChaTo) (20)

Detecting Algorithmic Bias (keynote at DIR 2016)
Detecting Algorithmic Bias (keynote at DIR 2016)Detecting Algorithmic Bias (keynote at DIR 2016)
Detecting Algorithmic Bias (keynote at DIR 2016)
 
Discrimination Discovery
Discrimination DiscoveryDiscrimination Discovery
Discrimination Discovery
 
Fairness-Aware Data Mining
Fairness-Aware Data MiningFairness-Aware Data Mining
Fairness-Aware Data Mining
 
Big Crisis Data for ISPC
Big Crisis Data for ISPCBig Crisis Data for ISPC
Big Crisis Data for ISPC
 
Databeers: Big Crisis Data
Databeers: Big Crisis DataDatabeers: Big Crisis Data
Databeers: Big Crisis Data
 
Observational studies in social media
Observational studies in social mediaObservational studies in social media
Observational studies in social media
 
Natural experiments
Natural experimentsNatural experiments
Natural experiments
 
Content-based link prediction
Content-based link predictionContent-based link prediction
Content-based link prediction
 
Link prediction
Link predictionLink prediction
Link prediction
 
Recommender Systems
Recommender SystemsRecommender Systems
Recommender Systems
 
Graph Partitioning and Spectral Methods
Graph Partitioning and Spectral MethodsGraph Partitioning and Spectral Methods
Graph Partitioning and Spectral Methods
 
Finding Dense Subgraphs
Finding Dense SubgraphsFinding Dense Subgraphs
Finding Dense Subgraphs
 
Graph Evolution Models
Graph Evolution ModelsGraph Evolution Models
Graph Evolution Models
 
Link-Based Ranking
Link-Based RankingLink-Based Ranking
Link-Based Ranking
 
Text Indexing / Inverted Indices
Text Indexing / Inverted IndicesText Indexing / Inverted Indices
Text Indexing / Inverted Indices
 
Indexing
IndexingIndexing
Indexing
 
Text Summarization
Text SummarizationText Summarization
Text Summarization
 
Hierarchical Clustering
Hierarchical ClusteringHierarchical Clustering
Hierarchical Clustering
 
K-Means Algorithm
K-Means AlgorithmK-Means Algorithm
K-Means Algorithm
 
Clustering
ClusteringClustering
Clustering
 

Kürzlich hochgeladen

Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
vu2urc
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
Joaquim Jorge
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
Earley Information Science
 

Kürzlich hochgeladen (20)

Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdf
 
Tech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdfTech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdf
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreter
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024
 

Using Topology to Identify Spam (SIGIR 2007)

  • 1. Web Spam Detection C. Castillo, D. Donato, A. Gionis, V. Murdock, F. Silvestri Spam on the Web Detecting Web Spam Know your Neighbors Link-Based Detection Web Spam Detection Using the Web Topology Content-Based Detection Using Links and Contents Carlos Castillo1 , Debora Donato1 , Aristides Gionis1 , Using the Web Topology Vanessa Murdock1 , Fabrizio Silvestri2 Conclusions 1. Yahoo! Research Barcelona – Catalunya, Spain 2. ISTI-CNR –Pisa,Italy ACM SIGIR, 25 July 2007, Amsterdam
  • 2. Web Spam Detection C. Castillo, D. Donato, A. Gionis, 1 Spam on the Web V. Murdock, F. Silvestri 2 Detecting Web Spam Spam on the Web Detecting Web Spam 3 Link-Based Detection Link-Based Detection Content-Based 4 Content-Based Detection Detection Using Links and Contents 5 Using Links and Contents Using the Web Topology Conclusions 6 Using the Web Topology 7 Conclusions
  • 3. Web Spam Detection C. Castillo, D. Donato, A. Gionis, V. Murdock, F. Silvestri Spam on the Web 1 Spam on the Web Detecting Web 2 Detecting Web Spam Spam 3 Link-Based Detection Link-Based 4 Content-Based Detection Detection 5 Using Links and Contents Content-Based Detection 6 Using the Web Topology Using Links and 7 Conclusions Contents Using the Web Topology Conclusions
  • 4. Web Spam Detection What is on the Web? C. Castillo, D. Donato, A. Gionis, V. Murdock, F. Silvestri Spam on the Web Detecting Web Spam Link-Based Detection Content-Based Detection Using Links and Contents Using the Web Topology Conclusions
  • 5. Web Spam Detection What is on the Web [2.0]? C. Castillo, D. Donato, A. Gionis, V. Murdock, F. Silvestri Spam on the Web Detecting Web Spam Link-Based Detection Content-Based Detection Using Links and Contents Using the Web Topology Conclusions
  • 6. Web Spam Detection What else is on the Web? C. Castillo, D. Donato, A. Gionis, V. Murdock, F. Silvestri Spam on the Web Detecting Web Spam Link-Based Detection Content-Based Detection Using Links and Contents Using the Web Topology Conclusions Source: www.milliondollarhomepage.com
  • 7. Web Spam Detection What’s happening on the Web? C. Castillo, D. Donato, A. Gionis, V. Murdock, F. Silvestri Spam on the Web Detecting Web Spam Link-Based Detection There is a fierce competition Content-Based Detection Using Links and Contents for your attention Using the Web Topology Conclusions
  • 8. Web Spam Detection What’s happening on the Web? C. Castillo, D. Donato, A. Gionis, V. Murdock, F. Silvestri Spam on the Web Detecting Web Spam Link-Based Search engines are to some extent Detection Content-Based Detection arbiters of this competition Using Links and Contents and they must watch it closely, otherwise ... Using the Web Topology Conclusions
  • 9. Web Spam Detection Some cheating occurs C. Castillo, D. Donato, A. Gionis, V. Murdock, F. Silvestri Spam on the Web Detecting Web Spam Link-Based Detection Content-Based Detection Using Links and Contents Using the Web Topology Conclusions 1986 FIFA World Cup, Argentina vs England
  • 10. Web Spam Detection Simple web spam C. Castillo, D. Donato, A. Gionis, V. Murdock, F. Silvestri Spam on the Web Detecting Web Spam Link-Based Detection Content-Based Detection Using Links and Contents Using the Web Topology Conclusions
  • 11. Web Spam Detection Hidden text C. Castillo, D. Donato, A. Gionis, V. Murdock, F. Silvestri Spam on the Web Detecting Web Spam Link-Based Detection Content-Based Detection Using Links and Contents Using the Web Topology Conclusions
  • 12. Web Spam Detection Made for advertising C. Castillo, D. Donato, A. Gionis, V. Murdock, F. Silvestri Spam on the Web Detecting Web Spam Link-Based Detection Content-Based Detection Using Links and Contents Using the Web Topology Conclusions
  • 13. Web Spam Detection Search engine? C. Castillo, D. Donato, A. Gionis, V. Murdock, F. Silvestri Spam on the Web Detecting Web Spam Link-Based Detection Content-Based Detection Using Links and Contents Using the Web Topology Conclusions
  • 14. Web Spam Detection Fake search engine C. Castillo, D. Donato, A. Gionis, V. Murdock, F. Silvestri Spam on the Web Detecting Web Spam Link-Based Detection Content-Based Detection Using Links and Contents Using the Web Topology Conclusions
  • 15. Web Spam Detection “Normal” content in link farms C. Castillo, D. Donato, A. Gionis, V. Murdock, F. Silvestri Spam on the Web Detecting Web Spam Link-Based Detection Content-Based Detection Using Links and Contents Using the Web Topology Conclusions
  • 16. Web Spam Detection There are many attempts of cheating on the Web C. Castillo, D. Donato, A. Gionis, V. Murdock, F. Silvestri Spam on the Web Detecting Web Spam Most of these are spam: Link-Based Detection 1,630,000 results for “free mp3 hilton viagra” in SE1 Content-Based Detection 1,760,000 results for “credit vicodin loan” in SE2 Using Links and Contents 1,320,000 results for “porn mortgage” in SE3 Using the Web Topology Conclusions
  • 17. Web Spam Detection Costs C. Castillo, D. Donato, A. Gionis, V. Murdock, F. Silvestri Spam on the Web Detecting Web Costs: Spam Link-Based X Costs for users: lower precision for some queries Detection Content-Based X Costs for search engines: wasted storage space, Detection network resources, and processing cycles Using Links and Contents X Costs for the publishers: resources invested in cheating Using the Web Topology and not in improving their contents Conclusions
  • 18. Web Spam Detection Cheating on the Web C. Castillo, D. Donato, A. Gionis, V. Murdock, F. Silvestri Z Link spam Spam on the Web Z Content spam Detecting Web Spam-oriented blogging Spam Link-Based Comment/forum/Wiki spam Detection Content-Based Malicious cloaking Detection Click fraud ×2 Using Links and Contents Malicious tagging Using the Web Topology . . . more? Conclusions Adversarial relationship Every undeserved gain in ranking for a spammer, is a loss of precision for the search engine.
  • 19. Web Spam Detection Cheating on the Web C. Castillo, D. Donato, A. Gionis, V. Murdock, F. Silvestri Z Link spam Spam on the Web Z Content spam Detecting Web Spam-oriented blogging Spam Link-Based Comment/forum/Wiki spam Detection Content-Based Malicious cloaking Detection Click fraud ×2 Using Links and Contents Malicious tagging Using the Web Topology . . . more? Conclusions Adversarial relationship Every undeserved gain in ranking for a spammer, is a loss of precision for the search engine.
  • 20. Web Spam Detection Cheating on the Web C. Castillo, D. Donato, A. Gionis, V. Murdock, F. Silvestri Z Link spam Spam on the Web Z Content spam Detecting Web Spam-oriented blogging Spam Link-Based Comment/forum/Wiki spam Detection Content-Based Malicious cloaking Detection Click fraud ×2 Using Links and Contents Malicious tagging Using the Web Topology . . . more? Conclusions Adversarial relationship Every undeserved gain in ranking for a spammer, is a loss of precision for the search engine.
  • 21. Web Spam Detection Cheating on the Web C. Castillo, D. Donato, A. Gionis, V. Murdock, F. Silvestri Z Link spam Spam on the Web Z Content spam Detecting Web Spam-oriented blogging Spam Link-Based Comment/forum/Wiki spam Detection Content-Based Malicious cloaking Detection Click fraud ×2 Using Links and Contents Malicious tagging Using the Web Topology . . . more? Conclusions Adversarial relationship Every undeserved gain in ranking for a spammer, is a loss of precision for the search engine.
  • 22. Web Spam Detection Cheating on the Web C. Castillo, D. Donato, A. Gionis, V. Murdock, F. Silvestri Z Link spam Spam on the Web Z Content spam Detecting Web Spam-oriented blogging Spam Link-Based Comment/forum/Wiki spam Detection Content-Based Malicious cloaking Detection Click fraud ×2 Using Links and Contents Malicious tagging Using the Web Topology . . . more? Conclusions Adversarial relationship Every undeserved gain in ranking for a spammer, is a loss of precision for the search engine.
  • 23. Web Spam Detection Cheating on the Web C. Castillo, D. Donato, A. Gionis, V. Murdock, F. Silvestri Z Link spam Spam on the Web Z Content spam Detecting Web Spam-oriented blogging Spam Link-Based Comment/forum/Wiki spam Detection Content-Based Malicious cloaking Detection Click fraud ×2 Using Links and Contents Malicious tagging Using the Web Topology . . . more? Conclusions Adversarial relationship Every undeserved gain in ranking for a spammer, is a loss of precision for the search engine.
  • 24. Web Spam Detection Cheating on the Web C. Castillo, D. Donato, A. Gionis, V. Murdock, F. Silvestri Z Link spam Spam on the Web Z Content spam Detecting Web Spam-oriented blogging Spam Link-Based Comment/forum/Wiki spam Detection Content-Based Malicious cloaking Detection Click fraud ×2 Using Links and Contents Malicious tagging Using the Web Topology . . . more? Conclusions Adversarial relationship Every undeserved gain in ranking for a spammer, is a loss of precision for the search engine.
  • 25. Web Spam Detection Cheating on the Web C. Castillo, D. Donato, A. Gionis, V. Murdock, F. Silvestri Z Link spam Spam on the Web Z Content spam Detecting Web Spam-oriented blogging Spam Link-Based Comment/forum/Wiki spam Detection Content-Based Malicious cloaking Detection Click fraud ×2 Using Links and Contents Malicious tagging Using the Web Topology . . . more? Conclusions Adversarial relationship Every undeserved gain in ranking for a spammer, is a loss of precision for the search engine.
  • 26. Web Spam Detection Cheating on the Web C. Castillo, D. Donato, A. Gionis, V. Murdock, F. Silvestri Z Link spam Spam on the Web Z Content spam Detecting Web Spam-oriented blogging Spam Link-Based Comment/forum/Wiki spam Detection Content-Based Malicious cloaking Detection Click fraud ×2 Using Links and Contents Malicious tagging Using the Web Topology . . . more? Conclusions Adversarial relationship Every undeserved gain in ranking for a spammer, is a loss of precision for the search engine.
  • 27. Web Spam Detection Cheating on the Web C. Castillo, D. Donato, A. Gionis, V. Murdock, F. Silvestri Z Link spam Spam on the Web Z Content spam Detecting Web Spam-oriented blogging Spam Link-Based Comment/forum/Wiki spam Detection Content-Based Malicious cloaking Detection Click fraud ×2 Using Links and Contents Malicious tagging Using the Web Topology . . . more? Conclusions Adversarial relationship Every undeserved gain in ranking for a spammer, is a loss of precision for the search engine.
  • 28. Web Spam Detection C. Castillo, D. Donato, A. Gionis, V. Murdock, F. Silvestri Spam on the Web 1 Spam on the Web Detecting Web 2 Detecting Web Spam Spam 3 Link-Based Detection Link-Based 4 Content-Based Detection Detection 5 Using Links and Contents Content-Based Detection 6 Using the Web Topology Using Links and 7 Conclusions Contents Using the Web Topology Conclusions
  • 29. Web Spam Detection Research on Web spam detection C. Castillo, D. Donato, A. Gionis, V. Murdock, F. Silvestri Spam on the Web Web spam detection techniques Detecting Web Spam Link-Based Detection Content-Based Detection Using Links and Contents Using the Web Topology Conclusions
  • 30. Web Spam Detection Spam, damn spam and statistics C. Castillo, D. Donato, A. Gionis, V. Murdock, F. Silvestri [Fetterly et al., 2004] propose to study statistical distributions: “in a number of these distributions, outlier Spam on the Web values are associated with web spam” Detecting Web Spam Link-Based Detection Content-Based Detection Using Links and Contents Using the Web Topology Conclusions
  • 31. Web Spam Detection Machine learning training C. Castillo, D. Donato, A. Gionis, V. Murdock, F. Silvestri Spam on the Web Detecting Web Spam Link-Based Detection Content-Based Detection Using Links and Contents Using the Web Topology Conclusions
  • 32. Web Spam Detection Machine learning C. Castillo, D. Donato, A. Gionis, V. Murdock, F. Silvestri Spam on the Web Detecting Web Spam Link-Based Detection Content-Based Detection Using Links and Contents Using the Web Topology Conclusions
  • 33. Web Spam Detection Challenges C. Castillo, D. Donato, A. Gionis, V. Murdock, F. Silvestri Spam on the Scalability Web + Machine Learning Challenges: Detecting Web Spam Instances are not really independent (graph) Link-Based Detection Training set is relatively small Content-Based Detection + Information Retrieval Challenges: Using Links and Contents It is hard to find out which features are relevant Using the Web Topology Features can be aggregated in content units: Conclusions page/host/domain Features can be propagated through the graph
  • 34. Web Spam Detection Challenges C. Castillo, D. Donato, A. Gionis, V. Murdock, F. Silvestri Spam on the Scalability Web + Machine Learning Challenges: Detecting Web Spam Instances are not really independent (graph) Link-Based Detection Training set is relatively small Content-Based Detection + Information Retrieval Challenges: Using Links and Contents It is hard to find out which features are relevant Using the Web Topology Features can be aggregated in content units: Conclusions page/host/domain Features can be propagated through the graph
  • 35. Web Spam Detection Challenges C. Castillo, D. Donato, A. Gionis, V. Murdock, F. Silvestri Spam on the Scalability Web + Machine Learning Challenges: Detecting Web Spam Instances are not really independent (graph) Link-Based Detection Training set is relatively small Content-Based Detection + Information Retrieval Challenges: Using Links and Contents It is hard to find out which features are relevant Using the Web Topology Features can be aggregated in content units: Conclusions page/host/domain Features can be propagated through the graph
  • 36. Web Spam Detection Challenges C. Castillo, D. Donato, A. Gionis, V. Murdock, F. Silvestri Spam on the Scalability Web + Machine Learning Challenges: Detecting Web Spam Instances are not really independent (graph) Link-Based Detection Training set is relatively small Content-Based Detection + Information Retrieval Challenges: Using Links and Contents It is hard to find out which features are relevant Using the Web Topology Features can be aggregated in content units: Conclusions page/host/domain Features can be propagated through the graph
  • 37. Web Spam Detection Challenges C. Castillo, D. Donato, A. Gionis, V. Murdock, F. Silvestri Spam on the Scalability Web + Machine Learning Challenges: Detecting Web Spam Instances are not really independent (graph) Link-Based Detection Training set is relatively small Content-Based Detection + Information Retrieval Challenges: Using Links and Contents It is hard to find out which features are relevant Using the Web Topology Features can be aggregated in content units: Conclusions page/host/domain Features can be propagated through the graph
  • 38. Web Spam Detection Challenges C. Castillo, D. Donato, A. Gionis, V. Murdock, F. Silvestri Spam on the Scalability Web + Machine Learning Challenges: Detecting Web Spam Instances are not really independent (graph) Link-Based Detection Training set is relatively small Content-Based Detection + Information Retrieval Challenges: Using Links and Contents It is hard to find out which features are relevant Using the Web Topology Features can be aggregated in content units: Conclusions page/host/domain Features can be propagated through the graph
  • 39. Web Spam Detection Challenges C. Castillo, D. Donato, A. Gionis, V. Murdock, F. Silvestri Spam on the Scalability Web + Machine Learning Challenges: Detecting Web Spam Instances are not really independent (graph) Link-Based Detection Training set is relatively small Content-Based Detection + Information Retrieval Challenges: Using Links and Contents It is hard to find out which features are relevant Using the Web Topology Features can be aggregated in content units: Conclusions page/host/domain Features can be propagated through the graph
  • 40. Web Spam Detection Challenges C. Castillo, D. Donato, A. Gionis, V. Murdock, F. Silvestri Spam on the Scalability Web + Machine Learning Challenges: Detecting Web Spam Instances are not really independent (graph) Link-Based Detection Training set is relatively small Content-Based Detection + Information Retrieval Challenges: Using Links and Contents It is hard to find out which features are relevant Using the Web Topology Features can be aggregated in content units: Conclusions page/host/domain Features can be propagated through the graph
  • 41. Web Spam Detection Challenges C. Castillo, D. Donato, A. Gionis, V. Murdock, F. Silvestri Spam on the Scalability Web + Machine Learning Challenges: Detecting Web Spam Instances are not really independent (graph) Link-Based Detection Training set is relatively small Content-Based Detection + Information Retrieval Challenges: Using Links and Contents It is hard to find out which features are relevant Using the Web Topology Features can be aggregated in content units: Conclusions page/host/domain Features can be propagated through the graph
  • 42. Web Spam Detection Training data C. Castillo, D. Donato, A. Gionis, V. Murdock, F. Silvestri Spam on the Web Detecting Web Spam X It is hard for search engines to provide labeled data Link-Based Detection X Even if they do, it will not reflect a consensus on what is Content-Based Detection Web Spam Using Links and V Public Web Spam collection built by a group of Contents volunteers: http://www.yr-bcn.es/webspam/ Using the Web Topology Conclusions
  • 43. Web Spam Detection Training data C. Castillo, D. Donato, A. Gionis, V. Murdock, F. Silvestri Spam on the Web Detecting Web Spam X It is hard for search engines to provide labeled data Link-Based Detection X Even if they do, it will not reflect a consensus on what is Content-Based Detection Web Spam Using Links and V Public Web Spam collection built by a group of Contents volunteers: http://www.yr-bcn.es/webspam/ Using the Web Topology Conclusions
  • 44. Web Spam Detection Training data C. Castillo, D. Donato, A. Gionis, V. Murdock, F. Silvestri Spam on the Web Detecting Web Spam X It is hard for search engines to provide labeled data Link-Based Detection X Even if they do, it will not reflect a consensus on what is Content-Based Detection Web Spam Using Links and V Public Web Spam collection built by a group of Contents volunteers: http://www.yr-bcn.es/webspam/ Using the Web Topology Conclusions
  • 45. Web Spam Detection C. Castillo, D. Donato, A. Gionis, V. Murdock, F. Silvestri Spam on the Web 1 Spam on the Web Detecting Web 2 Detecting Web Spam Spam 3 Link-Based Detection Link-Based 4 Content-Based Detection Detection 5 Using Links and Contents Content-Based Detection 6 Using the Web Topology Using Links and 7 Conclusions Contents Using the Web Topology Conclusions
  • 46. Web Spam Detection “Link farms” C. Castillo, D. Donato, A. Gionis, V. Murdock, F. Silvestri Spam on the Web Web Detecting Web Spam Link-Based Detection Content-Based Detection Using Links and Contents Link farm Using the Web Topology Conclusions Spam page Single-level farms can be detected by searching groups of nodes sharing their out-links [Gibson et al., 2005]
  • 47. Web Spam Detection Handling large-graphs C. Castillo, D. Donato, A. Gionis, V. Murdock, F. Silvestri Spam on the Web Detecting Web Spam Link-Based Detection Memory size enough to hold some data per-node Content-Based Disk size enough to hold some data per-edge Detection Using Links and A small number of passes over the data Contents Using the Web Topology Conclusions
  • 48. Web Spam Detection Semi-streaming model C. Castillo, D. Donato, A. Gionis, V. Murdock, F. Silvestri 1: for node : 1 . . . N do Spam on the Web 2: INITIALIZE-MEM(node) Detecting Web 3: end for Spam 4: for distance : 1 . . . d do {Iteration step} Link-Based Detection 5: for src : 1 . . . N do {Follow links in the graph} Content-Based 6: for all links from src to dest do Detection 7: COMPUTE(src,dest) Using Links and Contents 8: end for Using the Web 9: end for Topology 10: NORMALIZE Conclusions 11: end for 12: POST-PROCESS 13: return Something
  • 49. Web Spam Detection Semi-streaming model C. Castillo, D. Donato, A. Gionis, V. Murdock, F. Silvestri 1: for node : 1 . . . N do Spam on the Web 2: INITIALIZE-MEM(node) Detecting Web 3: end for Spam 4: for distance : 1 . . . d do {Iteration step} Link-Based Detection 5: for src : 1 . . . N do {Follow links in the graph} Content-Based 6: for all links from src to dest do Detection 7: COMPUTE(src,dest) Using Links and Contents 8: end for Using the Web 9: end for Topology 10: NORMALIZE Conclusions 11: end for 12: POST-PROCESS 13: return Something
  • 50. Web Spam Detection Semi-streaming model C. Castillo, D. Donato, A. Gionis, V. Murdock, F. Silvestri 1: for node : 1 . . . N do Spam on the Web 2: INITIALIZE-MEM(node) Detecting Web 3: end for Spam 4: for distance : 1 . . . d do {Iteration step} Link-Based Detection 5: for src : 1 . . . N do {Follow links in the graph} Content-Based 6: for all links from src to dest do Detection 7: COMPUTE(src,dest) Using Links and Contents 8: end for Using the Web 9: end for Topology 10: NORMALIZE Conclusions 11: end for 12: POST-PROCESS 13: return Something
  • 51. Web Spam Detection Link-based features C. Castillo, D. Donato, A. Gionis, V. Murdock, F. Silvestri Spam on the Web Detecting Web Spam Degree-related measures Link-Based PageRank Detection Content-Based TrustRank [Gy¨ngyi et al., 2004] o Detection Using Links and Truncated PageRank [Becchetti et al., 2006] Contents Estimation of supporters [Becchetti et al., 2006] Using the Web Topology 140 features per host (2 pages per host) Conclusions
  • 52. Web Spam Detection Degree-Based C. Castillo, D. Donato, A. Gionis, V. Murdock, 0.12 F. Silvestri Normal Spam 0.10 Spam on the Web 0.08 Detecting Web 0.06 Spam 0.04 Link-Based Detection 0.02 Content-Based 0.00 4 18 76 323 1380 5899 25212 107764 460609 1968753 Detection 0.14 Normal Using Links and Spam Contents 0.12 0.10 Using the Web Topology 0.08 Conclusions 0.06 0.04 0.02 0.00 0.0 0.0 0.0 0.1 0.6 4.9 40.0 327.9 2686.5 22009.9
  • 53. Web Spam Detection TrustRank C. Castillo, D. Donato, A. Gionis, V. Murdock, F. Silvestri Spam on the Web Detecting Web Spam Link-Based Detection Content-Based Detection Using Links and Contents Using the Web Topology Conclusions [Gy¨ngyi et al., 2004] o
  • 54. Web Spam Detection TrustRank / PageRank C. Castillo, D. Donato, A. Gionis, V. Murdock, F. Silvestri Spam on the Web 1.00 Normal Detecting Web 0.90 Spam Spam 0.80 Link-Based 0.70 Detection 0.60 0.50 Content-Based 0.40 Detection 0.30 Using Links and 0.20 Contents 0.10 Using the Web 0.00 0.4 1 4 1e+01 4e+01 1e+02 3e+02 1e+03 3e+03 9e+03 Topology Conclusions
  • 55. Web Spam Detection Truncated PageRank C. Castillo, D. Donato, A. Gionis, V. Murdock, F. Silvestri Proposed in [Becchetti et al., 2006]. Idea: reduce the direct Spam on the Web contribution of the first levels of links: Detecting Web Spam Link-Based Detection Content-Based Detection Using Links and Contents Using the Web Topology 0 t≤T Conclusions damping(t) = C αt t>T V No extra reading of the graph after PageRank
  • 56. Web Spam Detection Truncated PageRank C. Castillo, D. Donato, A. Gionis, V. Murdock, F. Silvestri Proposed in [Becchetti et al., 2006]. Idea: reduce the direct Spam on the Web contribution of the first levels of links: Detecting Web Spam Link-Based Detection Content-Based Detection Using Links and Contents Using the Web Topology 0 t≤T Conclusions damping(t) = C αt t>T V No extra reading of the graph after PageRank
  • 57. Web Spam Detection Hop-plot C. Castillo, D. Donato, A. Gionis, V. Murdock, F. Silvestri Spam on the Web Detecting Web Spam Link-Based Detection Content-Based Detection Using Links and Contents Using the Web Topology Conclusions
  • 58. Web Spam Detection High and low-ranked pages are different C. Castillo, D. Donato, A. Gionis, 4 V. Murdock, x 10 F. Silvestri Top 0%−10% 12 Top 40%−50% Spam on the Web Top 60%−70% 10 Number of Nodes Detecting Web Spam Link-Based 8 Detection Content-Based 6 Detection Using Links and Contents 4 Using the Web Topology 2 Conclusions 0 1 5 10 15 20 Distance Areas below the curves are equal if we are in the same strongly-connected component
  • 59. Web Spam Detection High and low-ranked pages are different C. Castillo, D. Donato, A. Gionis, 4 V. Murdock, x 10 F. Silvestri Top 0%−10% 12 Top 40%−50% Spam on the Web Top 60%−70% 10 Number of Nodes Detecting Web Spam Link-Based 8 Detection Content-Based 6 Detection Using Links and Contents 4 Using the Web Topology 2 Conclusions 0 1 5 10 15 20 Distance Areas below the curves are equal if we are in the same strongly-connected component
  • 60. Web Spam Detection Probabilistic counting C. Castillo, D. Donato, A. Gionis, V. Murdock, F. Silvestri 1 1 0 0 0 0 Spam on the 0 0 Web 1 0 1 1 1 1 1 0 0 0 1 Detecting Web 0 0 0 0 0 Spam 0 Propagation of 0 1 1 0 bits using the 1 1 Link-Based 1 0 “OR” operation 1 1 Detection 0 Content-Based 0 Target 1 Count bits set Detection 0 page 0 to estimate 0 0 supporters Using Links and 0 0 Contents 1 1 1 1 0 0 1 Using the Web 1 0 0 Topology 0 0 1 1 Conclusions 0 0 [Becchetti et al., 2006] shows an improvement of ANF algorithm [Palmer et al., 2002] based on probabilistic counting [Flajolet and Martin, 1985]
  • 61. Web Spam Detection Probabilistic counting C. Castillo, D. Donato, A. Gionis, V. Murdock, F. Silvestri 1 1 0 0 0 0 Spam on the 0 0 Web 1 0 1 1 1 1 1 0 0 0 1 Detecting Web 0 0 0 0 0 Spam 0 Propagation of 0 1 1 0 bits using the 1 1 Link-Based 1 0 “OR” operation 1 1 Detection 0 Content-Based 0 Target 1 Count bits set Detection 0 page 0 to estimate 0 0 supporters Using Links and 0 0 Contents 1 1 1 1 0 0 1 Using the Web 1 0 0 Topology 0 0 1 1 Conclusions 0 0 [Becchetti et al., 2006] shows an improvement of ANF algorithm [Palmer et al., 2002] based on probabilistic counting [Flajolet and Martin, 1985]
  • 62. Web Spam Detection Bottleneck number C. Castillo, D. Donato, A. Gionis, V. Murdock, F. Silvestri Spam on the Web Detecting Web Spam |Nj (x)| Link-Based bd (x) = minj≤d |Nj−1 (x)| . Detection Content-Based Detection Minimum rate of growth of the neighbors of x up to a certain Using Links and Contents distance. Using the Web Topology Conclusions
  • 63. Web Spam Detection Bottleneck number: spam C. Castillo, D. Donato, A. Gionis, V. Murdock, F. Silvestri Spam on the Web Detecting Web Spam Link-Based Detection Content-Based Detection Using Links and Contents Using the Web Topology Conclusions
  • 64. Web Spam Detection Bottleneck number: normal C. Castillo, D. Donato, A. Gionis, V. Murdock, F. Silvestri Spam on the Web Detecting Web Spam Link-Based Detection Content-Based Detection Using Links and Contents Using the Web Topology Conclusions
  • 65. Web Spam Detection Bottleneck number C. Castillo, D. Donato, A. Gionis, V. Murdock, F. Silvestri Spam on the Web Detecting Web bd (x) = minj≤d {|Nj (x)|/|Nj−1 (x)|}. Spam 0.40 Normal Spam 0.35 Link-Based Detection 0.30 0.25 Content-Based Detection 0.20 0.15 Using Links and Contents 0.10 Using the Web 0.05 Topology 0.00 1.11 1.30 1.52 1.78 2.07 2.42 2.83 3.31 3.87 4.52 Conclusions
  • 66. Web Spam Detection C. Castillo, D. Donato, A. Gionis, V. Murdock, F. Silvestri Spam on the Web 1 Spam on the Web Detecting Web 2 Detecting Web Spam Spam 3 Link-Based Detection Link-Based 4 Content-Based Detection Detection 5 Using Links and Contents Content-Based Detection 6 Using the Web Topology Using Links and 7 Conclusions Contents Using the Web Topology Conclusions
  • 67. Web Spam Detection Content-Based Features C. Castillo, D. Donato, A. Gionis, V. Murdock, F. Silvestri Most of the features reported in [Ntoulas et al., 2006] Spam on the Web Number of word in the page and title Detecting Web Spam Average word length Link-Based Fraction of anchor text Detection Content-Based Fraction of visible text Detection Compression rate Using Links and Contents Corpus precision and corpus recall Using the Web Topology Query precision and query recall Conclusions Independent trigram likelihood Entropy of trigrams 96 features per host
  • 68. Web Spam Detection Content-based features (entropy related) C. Castillo, D. Donato, A. Gionis, V. Murdock, F. Silvestri Spam on the Web Detecting Web Spam T = {(w1 , p1 ), . . . , (wk , pk )} the set of trigrams in a page, Link-Based Detection where trigram wi has frequency pi Content-Based Detection Features: Using Links and Contents Entropy of trigrams H = − wi ∈T pi log pi Using the Web Topology Also, compression rate, as measured by bzip Conclusions
  • 69. Web Spam Detection Content-based features (related to popular C. Castillo, D. Donato, keywords) A. Gionis, V. Murdock, F. Silvestri Spam on the Web F set of most frequent terms in the collection Detecting Web Spam Q set of most frequent terms in a query log Link-Based Detection P set of terms in a page Content-Based Detection Features: Using Links and Contents Corpus “precision” |P ∩ F |/|P| Using the Web Topology Corpus “recall” |P ∩ F |/|F | Conclusions Query “precision” |P ∩ Q|/|P| Query “recall” |P ∩ Q|/|Q|
  • 70. Web Spam Detection Average word length C. Castillo, D. Donato, A. Gionis, V. Murdock, F. Silvestri 0.12 Normal Spam Spam on the 0.10 Web Detecting Web 0.08 Spam Link-Based Detection 0.06 Content-Based Detection 0.04 Using Links and Contents 0.02 Using the Web Topology 0.00 3.0 3.5 4.0 4.5 5.0 5.5 6.0 6.5 7.0 7.5 Conclusions Figure: Histogram of the average word length in non-spam vs. spam pages for k = 500.
  • 71. Web Spam Detection Corpus precision C. Castillo, D. Donato, A. Gionis, V. Murdock, F. Silvestri 0.10 Normal 0.09 Spam Spam on the Web 0.08 Detecting Web 0.07 Spam 0.06 Link-Based Detection 0.05 0.04 Content-Based Detection 0.03 Using Links and 0.02 Contents 0.01 Using the Web Topology 0.00 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 Conclusions Figure: Histogram of the corpus precision in non-spam vs. spam pages.
  • 72. Web Spam Detection Query precision C. Castillo, D. Donato, A. Gionis, V. Murdock, F. Silvestri 0.12 Normal Spam Spam on the 0.10 Web Detecting Web 0.08 Spam Link-Based Detection 0.06 Content-Based Detection 0.04 Using Links and Contents 0.02 Using the Web Topology 0.00 0.0 0.1 0.2 0.3 0.4 0.5 0.6 Conclusions Figure: Histogram of the query precision in non-spam vs. spam pages for k = 500.
  • 73. Web Spam Detection C. Castillo, D. Donato, A. Gionis, V. Murdock, F. Silvestri Spam on the Web 1 Spam on the Web Detecting Web 2 Detecting Web Spam Spam 3 Link-Based Detection Link-Based 4 Content-Based Detection Detection 5 Using Links and Contents Content-Based Detection 6 Using the Web Topology Using Links and 7 Conclusions Contents Using the Web Topology Conclusions
  • 74. Web Spam Detection Cost-sensitive decision tree with bagging C. Castillo, D. Donato, A. Gionis, V. Murdock, F. Silvestri Spam on the Web Detecting Web Spam Bagging of 10 decision trees, asymmetrical costs. Link-Based Detection Cost ratio 1 10 20 30 50 Content-Based Detection True positive rate 65.8% 66.7% 71.1% 78.7% 84.1% Using Links and Contents False positive rate 2.8% 3.4% 4.5% 5.7% 8.6% Using the Web F-Measure 0.712 0.703 0.704 0.723 0.692 Topology Conclusions
  • 75. Web Spam Detection Link- and content-based features C. Castillo, D. Donato, A. Gionis, V. Murdock, F. Silvestri Spam on the Web Detecting Web Spam Link-based and content-based Link-Based Detection Both Link-only Content-only Content-Based Detection True positive rate 78.7% 79.4% 64.9% Using Links and False positive rate 5.7% 9.0% 3.7% Contents Using the Web F-Measure 0.723 0.659 0.683 Topology Conclusions
  • 76. Web Spam Detection C. Castillo, D. Donato, A. Gionis, V. Murdock, F. Silvestri Spam on the Web 1 Spam on the Web Detecting Web 2 Detecting Web Spam Spam 3 Link-Based Detection Link-Based 4 Content-Based Detection Detection 5 Using Links and Contents Content-Based Detection 6 Using the Web Topology Using Links and 7 Conclusions Contents Using the Web Topology Conclusions
  • 77. Web Spam Detection Hypothesis C. Castillo, D. Donato, A. Gionis, V. Murdock, F. Silvestri Spam on the Web Detecting Web Spam Pages topologically close to each other are more likely Link-Based Detection to have the same label (spam/nonspam) than random Content-Based pairs of pages. Detection Using Links and Contents Pages linked together are more likely to be on the same topic Using the Web than random pairs of pages [Davison, 2000] Topology Conclusions
  • 78. Web Spam Detection Hypothesis C. Castillo, D. Donato, A. Gionis, V. Murdock, F. Silvestri Spam on the Web Detecting Web Spam Pages topologically close to each other are more likely Link-Based Detection to have the same label (spam/nonspam) than random Content-Based pairs of pages. Detection Using Links and Contents Pages linked together are more likely to be on the same topic Using the Web than random pairs of pages [Davison, 2000] Topology Conclusions
  • 79. Web Spam Detection C. Castillo, D. Donato, A. Gionis, V. Murdock, F. Silvestri Spam on the Web Detecting Web Spam Link-Based Detection Content-Based Detection Using Links and Contents Using the Web Topology Conclusions
  • 80. Web Spam Detection Topological dependencies: in-links C. Castillo, D. Donato, A. Gionis, V. Murdock, F. Silvestri Histogram of fraction of spam hosts in the in-links Spam on the Web 0 = no in-link comes from spam hosts Detecting Web 1 = all of the in-links come from spam hosts Spam Link-Based Detection 0.4 In-links of non spam Content-Based 0.35 In-links of spam Detection 0.3 Using Links and Contents 0.25 Using the Web 0.2 Topology 0.15 Conclusions 0.1 0.05 0 0.0 0.2 0.4 0.6 0.8 1.0
  • 81. Web Spam Detection Topological dependencies: out-links C. Castillo, D. Donato, A. Gionis, V. Murdock, F. Silvestri Histogram of fraction of spam hosts in the out-links Spam on the Web 0 = none of the out-links points to spam hosts Detecting Web 1 = all of the out-links point to spam hosts Spam Link-Based Detection 1 Out-links of non spam Content-Based 0.9 Outlinks of spam Detection 0.8 Using Links and 0.7 Contents 0.6 Using the Web 0.5 Topology 0.4 Conclusions 0.3 0.2 0.1 0 0.0 0.2 0.4 0.6 0.8 1.0
  • 82. Web Spam Detection Idea 1: Clustering C. Castillo, D. Donato, A. Gionis, V. Murdock, F. Silvestri Spam on the Web Detecting Web Spam Link-Based Detection Classify, then cluster hosts, then assign the same label to all Content-Based Detection hosts in the same cluster by majority voting Using Links and Contents Using the Web Topology Conclusions
  • 83. Web Spam Detection Idea 1: Clustering (cont.) C. Castillo, D. Donato, A. Gionis, V. Murdock, F. Silvestri Initial prediction: Spam on the Web Detecting Web Spam Link-Based Detection Content-Based Detection Using Links and Contents Using the Web Topology Conclusions
  • 84. Web Spam Detection Idea 1: Clustering (cont.) C. Castillo, D. Donato, A. Gionis, V. Murdock, F. Silvestri Clustering: Spam on the Web Detecting Web Spam Link-Based Detection Content-Based Detection Using Links and Contents Using the Web Topology Conclusions
  • 85. Web Spam Detection Idea 1: Clustering (cont.) C. Castillo, D. Donato, A. Gionis, V. Murdock, F. Silvestri Final prediction: Spam on the Web Detecting Web Spam Link-Based Detection Content-Based Detection Using Links and Contents Using the Web Topology Conclusions
  • 86. Web Spam Detection Idea 1: Clustering – Results C. Castillo, D. Donato, A. Gionis, V. Murdock, F. Silvestri Spam on the Baseline Clustering Web Detecting Web Without bagging Spam True positive rate 75.6% 74.5% Link-Based Detection False positive rate 8.5% 6.8% Content-Based F-Measure 0.646 0.673 Detection With bagging Using Links and Contents True positive rate 78.7% 76.9% Using the Web False positive rate 5.7% 5.0% Topology F-Measure 0.723 0.728 Conclusions V Reduces error rate
  • 87. Web Spam Detection Idea 2: Propagate the label C. Castillo, D. Donato, A. Gionis, V. Murdock, F. Silvestri Spam on the Web Detecting Web Spam Link-Based Detection Classify, then interpret “spamicity” as a probability, then do a Content-Based Detection random walk with restart from those nodes Using Links and Contents Using the Web Topology Conclusions
  • 88. Web Spam Detection Idea 2: Propagate the label (cont.) C. Castillo, D. Donato, A. Gionis, V. Murdock, F. Silvestri Spam on the Initial prediction: Web Detecting Web Spam Link-Based Detection Content-Based Detection Using Links and Contents Using the Web Topology Conclusions
  • 89. Web Spam Detection Idea 2: Propagate the label (cont.) C. Castillo, D. Donato, A. Gionis, V. Murdock, F. Silvestri Spam on the Propagation: Web Detecting Web Spam Link-Based Detection Content-Based Detection Using Links and Contents Using the Web Topology Conclusions
  • 90. Web Spam Detection Idea 2: Propagate the label (cont.) C. Castillo, D. Donato, A. Gionis, V. Murdock, F. Silvestri Spam on the Final prediction, applying a threshold: Web Detecting Web Spam Link-Based Detection Content-Based Detection Using Links and Contents Using the Web Topology Conclusions
  • 91. Web Spam Detection Idea 2: Propagate the label – Results C. Castillo, D. Donato, A. Gionis, V. Murdock, F. Silvestri Spam on the Web Baseline Fwds. Backwds. Both Detecting Web Spam Classifier without bagging Link-Based True positive rate 75.6% 70.9% 69.4% 71.4% Detection False positive rate 8.5% 6.1% 5.8% 5.8% Content-Based Detection F-Measure 0.646 0.665 0.664 0.676 Using Links and Classifier with bagging Contents True positive rate 78.7% 76.5% 75.0% 75.2% Using the Web Topology False positive rate 5.7% 5.4% 4.3% 4.7% Conclusions F-Measure 0.723 0.716 0.733 0.724
  • 92. Web Spam Detection Idea 3: Stacked graphical learning C. Castillo, D. Donato, A. Gionis, V. Murdock, F. Silvestri Spam on the Web Detecting Web Spam Meta-learning scheme [Cohen and Kou, 2006] Link-Based Detection Derive initial predictions Content-Based Detection Generate an additional attribute for each object by Using Links and combining predictions on neighbors in the graph Contents Using the Web Append additional attribute in the data and retrain Topology Conclusions
  • 93. Web Spam Detection Idea 3: Stacked graphical learning (cont.) C. Castillo, D. Donato, A. Gionis, V. Murdock, F. Silvestri Spam on the Web Let p(x) ∈ [0..1] be the prediction of a classification Detecting Web Spam algorithm for a host x using k features Link-Based Detection Let N(x) be the set of pages related to x (in some way) Content-Based Compute Detection g ∈N(x) p(g ) Using Links and f (x) = Contents |N(x)| Using the Web Topology Add f (x) as an extra feature for instance x and learn a Conclusions new model with k + 1 features
  • 94. Web Spam Detection Idea 3: Stacked graphical learning (cont.) C. Castillo, D. Donato, A. Gionis, V. Murdock, F. Silvestri Spam on the Web Detecting Web Spam Initial prediction: Link-Based Detection Content-Based Detection Using Links and Contents Using the Web Topology Conclusions
  • 95. Web Spam Detection Idea 3: Stacked graphical learning (cont.) C. Castillo, D. Donato, A. Gionis, V. Murdock, F. Silvestri Spam on the Web Computation of new feature: Detecting Web Spam Link-Based Detection Content-Based Detection Using Links and Contents Using the Web Topology Conclusions
  • 96. Web Spam Detection Idea 3: Stacked graphical learning (cont.) C. Castillo, D. Donato, A. Gionis, V. Murdock, F. Silvestri Spam on the New prediction with k + 1 features: Web Detecting Web Spam Link-Based Detection Content-Based Detection Using Links and Contents Using the Web Topology Conclusions
  • 97. Web Spam Detection Idea 3: Stacked graphical learning - Results C. Castillo, D. Donato, A. Gionis, V. Murdock, F. Silvestri Spam on the Web Detecting Web Spam Avg. Avg. Avg. Link-Based Baseline of in of out of both Detection True positive rate 78.7% 84.4% 78.3% 85.2% Content-Based Detection False positive rate 5.7% 6.7% 4.8% 6.1% Using Links and F-Measure 0.723 0.733 0.742 0.750 Contents Using the Web Topology V Increases detection rate Conclusions
  • 98. Web Spam Detection Idea 3: Stacked graphical learning x2 C. Castillo, D. Donato, A. Gionis, V. Murdock, F. Silvestri Spam on the Web Detecting Web And repeat ... Spam Link-Based Detection Baseline First pass Second pass Content-Based True positive rate 78.7% 85.2% 88.4% Detection False positive rate 5.7% 6.1% 6.3% Using Links and Contents F-Measure 0.723 0.750 0.763 Using the Web Topology V Significant improvement over the baseline Conclusions
  • 99. Web Spam Detection C. Castillo, D. Donato, A. Gionis, V. Murdock, F. Silvestri Spam on the Web 1 Spam on the Web Detecting Web 2 Detecting Web Spam Spam 3 Link-Based Detection Link-Based 4 Content-Based Detection Detection 5 Using Links and Contents Content-Based Detection 6 Using the Web Topology Using Links and 7 Conclusions Contents Using the Web Topology Conclusions
  • 100. Web Spam Detection Concluding remarks C. Castillo, D. Donato, A. Gionis, V. Murdock, F. Silvestri Spam on the Web Detecting Web Spam Link-Based Detection V Considering content-based and link-based attributes Content-Based improves the accuracy of the classifier Detection Using Links and V Considering the links among pages improves the accuracy Contents Using the Web Topology Conclusions
  • 101. Web Spam Detection Concluding remarks C. Castillo, D. Donato, A. Gionis, V. Murdock, F. Silvestri Spam on the Web Detecting Web Spam Link-Based Detection V Considering content-based and link-based attributes Content-Based improves the accuracy of the classifier Detection Using Links and V Considering the links among pages improves the accuracy Contents Using the Web Topology Conclusions
  • 102. Web Spam Detection C. Castillo, D. Donato, A. Gionis, V. Murdock, F. Silvestri i Web Spam Dataset: http://www.yr-bcn.es/webspam/ Spam on the Web i Web Spam Challenge I & II: http://webspam.lip6.fr/ Detecting Web Spam i AIRWeb Workshop: http://airweb.cse.lehigh.edu/ Link-Based i GraphLab at ECML/PKDD: http://graphlab.lip6.fr/ Detection Content-Based Detection B Newsletter: webspam-announces@yahoogroups.com Using Links and Contents Using the Web Topology Conclusions Thank you!
  • 103. Web Spam Detection C. Castillo, D. Donato, A. Gionis, V. Murdock, F. Silvestri i Web Spam Dataset: http://www.yr-bcn.es/webspam/ Spam on the Web i Web Spam Challenge I & II: http://webspam.lip6.fr/ Detecting Web Spam i AIRWeb Workshop: http://airweb.cse.lehigh.edu/ Link-Based i GraphLab at ECML/PKDD: http://graphlab.lip6.fr/ Detection Content-Based Detection B Newsletter: webspam-announces@yahoogroups.com Using Links and Contents Using the Web Topology Conclusions Thank you!
  • 104. Web Spam Detection Becchetti, L., Castillo, C., Donato, D., Leonardi, S., and Baeza-Yates, R. C. Castillo, (2006). D. Donato, Using rank propagation and probabilistic counting for link-based spam A. Gionis, V. Murdock, detection. F. Silvestri In Proceedings of the Workshop on Web Mining and Web Usage Analysis (WebKDD), Pennsylvania, USA. ACM Press. Spam on the Web Cohen, W. W. and Kou, Z. (2006). Detecting Web Stacked graphical learning: approximating learning in markov random Spam fields using very short inhomogeneous markov chains. Link-Based Technical report. Detection Davison, B. D. (2000). Content-Based Topical locality in the web. Detection In Proceedings of the 23rd annual international ACM SIGIR conference on Using Links and research and development in information retrieval, pages 272–279, Athens, Contents Greece. ACM Press. Using the Web Topology Fetterly, D., Manasse, M., and Najork, M. (2004). Spam, damn spam, and statistics: Using statistical analysis to locate spam Conclusions web pages. In Proceedings of the seventh workshop on the Web and databases (WebDB), pages 1–6, Paris, France. Flajolet, P. and Martin, N. G. (1985). Probabilistic counting algorithms for data base applications. Journal of Computer and System Sciences, 31(2):182–209.
  • 105. Web Spam Detection C. Castillo, D. Donato, A. Gionis, Gibson, D., Kumar, R., and Tomkins, A. (2005). V. Murdock, Discovering large dense subgraphs in massive graphs. F. Silvestri In VLDB ’05: Proceedings of the 31st international conference on Very Spam on the large data bases, pages 721–732. VLDB Endowment. Web Gy¨ngyi, Z., Garcia-Molina, H., and Pedersen, J. (2004). o Detecting Web Combating Web spam with TrustRank. Spam In Proceedings of the 30th International Conference on Very Large Data Link-Based Bases (VLDB), pages 576–587, Toronto, Canada. Morgan Kaufmann. Detection Content-Based Ntoulas, A., Najork, M., Manasse, M., and Fetterly, D. (2006). Detection Detecting spam web pages through content analysis. Using Links and In Proceedings of the World Wide Web conference, pages 83–92, Contents Edinburgh, Scotland. Using the Web Palmer, C. R., Gibbons, P. B., and Faloutsos, C. (2002). Topology ANF: a fast and scalable tool for data mining in massive graphs. Conclusions In Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining, pages 81–90, New York, NY, USA. ACM Press.