SlideShare a Scribd company logo
1 of 5
Download to read offline
International Journal of Engineering Inventions
e-ISSN: 2278-7461, p-ISSN: 2319-6491
Volume 4, Issue 1 (July 2014) PP: 55-59
www.ijeijournal.com Page | 55
Page Rank Link Farm Detection
Akshay Saxena1
, Rohit Nigam2
1, 2
Department Of Information and Communication Technology (ICT) Manipal University, Manipal - 576014,
Karnataka, India
Abstract: The PageRank algorithm is an important algorithm which is implemented to determine the quality of
a page on the web. With search engines attaining a high position in guiding the traffic on the internet,
PageRank is an important factor to determine its flow. Since link analysis is used in search engine’s ranking
systems, link based spam structure known as link farms are created by spammers to generate a high PageRank
for their and in turn a target page. In this paper, we suggest a method through which these structures can be
detected and thus the overall ranking results can be improved.
I. Introduction
1.1 Page Rank
We will be working primarily on PageRank[1]
algorithm developed by Larry Page of Google. This
algorithm determines the importance of a website by counting the number and quality of links to a page,
assuming that an important website will have more links pointing to it by other websites.
This technique is used in Google's search process to rank websites of a search result.
It is not the only algorithm used by Google to order search engine results, but it is the first algorithm
that was used by the company, and it is the best-known. The counting and indexing of Web pages is done
through various programs like webcrawlers[1]
and other bot programs. By assigning a numerical weighting to
each link, it repeatedly applies the PageRank algorithm to get a more accurate result and determine the relative
rank of a Web page within a set pages.
The set of Web pages being considered is thought of as a directed graph with each node representing a
page and each edge as a link between pages. We determine the number of links pointing to a page by traversing
the graph and use those values in the PageRank formula.
Pagerank is quite prone to manipulation. Our goal was to find an effective way to ignore links from
pages that are trying to falsify a Pagerank. These spamming pages usually occur as a group called link farms
described below.
1.2 Complete Graph
In the mathematical field of graph theory, a complete graph[2]
is a simple undirected graph in which
every pair of distinct vertices is connected by a unique edge. A complete digraph is a directed graph in which
every pair of distinct vertices is connected by a pair of unique edges (one in each direction).
1.3 Clique
In the mathematical area of graph theory, a clique[2]
an undirected graph is a subset of its vertices such
that every two vertices in the subset are connected by an edge. Cliques are one of the basic concepts of graph
theory and are used in many other mathematical problems and constructions on graphs. Cliques have also been
studied in computer science: the task of finding whether there is a clique of a given size in a graph (the clique
problem) is NP-complete, but despite this hardness result many algorithms for finding cliques have been
studied.
FIG 1.1
Page Rank Link Farm Detection
www.ijeijournal.com Page | 56
1.4 Link Farm
On the World Wide Web, a link farm[3]
is any group of web sites that all hyperlink to every other site in
the group. In graph theoretic terms, a link farm is a clique. Although some link farms can be created by hand,
most are created through automated programs and services. A link farm is a form of spamming the index of a
search engine (sometimes called spamdexing or spamexing). Other link exchange systems are designed to allow
individual websites to selectively exchange links with other relevant websites and are not considered a form of
spamdexing.
FIG 1.2 Normal Graph of Web Pages FIG 1.3 Link Farm of 1, 2, 3 & 4
II. Existing System
Academic citation literature has been applied to the web, largely by counting citations or backlinks to a
given page. This gives some approximation of a page's importance or quality. PageRank extends this idea by not
counting links from all pages equally, and by normalizing by the number of links on a page. PageRank is
defined as follows:
We assume page A has pages T1...Tn which point to it (i.e., are inbound links). The parameter d is a
damping factor which can be set between 0 and 1. We usually set d to 0.85. There are more details about d in the
next section. Also C(A) is defined as the number of links going out of page A. The PageRank of a page A is
given as follows:
PR(A) = (1-d)/N + d (PR(T1)/C(T1) + ... + PR(Tn)/C(Tn))
Note that the PageRank values form a probability distribution over web pages, so the sum of all web
pages' PageRank values will be one.
PageRank or PR(A) can be calculated using a simple iterative algorithm, and corresponds to the
principal eigenvector of the normalized link matrix of the web. Also, a PageRank for 26 million web pages can
be computed in a few hours on a medium size workstation
III. Previous Work
Link spamming is one of the web spamming techniques that try to mislead link-based ranking
algorithms such as PageRank and HITS. Since these algorithms consider a link to pages as an endorsement for
that page, spammers create numerous false links and construct an artificially interlinked link structure, so called
a spam farm, to centralize link-based importance to their own spam pages.
To understand the web spamming, Gyöngyi et al[4]
described various web spamming techniques in.
Optimal link structures to boost PageRank scores are also studied to grasp the behavior of web spammers.
Fetterly[5]
found out that outliers in statistical distributions are very likely to be spam by analyzing statistical
properties of linkage, URL, host resolutions and contents of pages. To demote link spam, Gyöngyi et
alintroduced TrustRank[4]
that is a biased PageRank where rank scores start to be propagated from a seed set of
good pages through outgoing links. By this, we can expect spam pages to get low rank. Optimizing the link
structure is another approach to demote link spam.
Carvalho[6]
proposed the idea of noisy links, a link structure that has a negative impact on link-based
ranking algorithms. Qi et al. also estimated the quality of links by similarity of two pages. To detect link spam,
Bencz´ur introduced SpamRank[7]
. SpamRank checks PageRank score distributions of all in-neighbors of a
target page. If this distribution is abnormal, SpamRank regards a target page as a spam and penalizes it.
Becchetti[8]
employed link-based features for the link spam detection. They built a link spam classifier
with several features of the link structure like degrees, link-based ranking scores, and characteristics of out-
Page Rank Link Farm Detection
www.ijeijournal.com Page | 57
neighbors. Saito[9]
employed a graph algorithm to detect link spam. They decomposed the Web graph into
strongly connected components and discovered that large components are spam with high probability. Link
farms in the core were extracted by maximal clique enumeration.
IV. Proposed System
The PageRank algorithm, as described in the previous section, works upon a given set of pages found
via search using Web spiders or other programs. Now to rank these pages, we consider the links of each page
and apply the PageRank algorithm to it. However, the page rank algorithm can be easily fooled by link farms.
Link Farms work as follows - We know that the PageRank is higher for pages with more links pointing to it i.e.
more the in-line, higher the rank. A link farm forms a complete graph. This means, we have a large set of pages
pointing to each other, thus increasing the number of in-links and out-links even they are not relevant, hence
they falsify the Page Rank values obtained. On applying the algorithm to this graph, an interesting fact came to
notice that the Page Rank values of all the pages in a link farm was the same and remained so even after
multiple iterations of the algorithm. One could easily use this fact to identify link farms and remove such sets of
pages from our whole set. But the problem occurs when a page which is not a part of the link farm and is
actually relevant, it happens to have the same or nearly same PageRank as that of the link farm pages. So if we
consider pages with same page ranks for elimination, some relevant pages might also get discarded
unintentionally.
To solve this problem, we introduce a new factor we call Gap Rank. GapRank is based on PageRank
and in a way inverse of it. While PageRank is based on inbound links of a page, GapRank is calculated using
outbound links of pages.
Gaprank Formula
GR(A) = (1-d)/N + d (GR(T1)/L(T1) + ... + GR(Tn)/L(Tn))
where A is the page under consideration, T1,T2….Tn the set of pages that link from A i.e. outbound
pages,L(Tn) is the number of inbound links on page Tn, and N is the total number of pages.
The purpose of this formula is to rank the set of pages based on the number of outbound links. We then use the
GapRank to identify the Link Farm. Since all the pages in a link farm have same number of inbound links and
outbound links, the GapRank of these pages will be equal, just like their PageRanks. This can now be used to
differentiate a link farm from a page having almost same PageRank as those of link farm pages, but it's
GapRank will be different.
This analysis can further be used to reduce the ranks of these spam pages or eliminate them altogether
so that we get proper ranks for pages that suffered because of the link farm.
V. Methodology
Consider a given dataset consisting of 11 pages P1, P2,..,P11. The links between the pages can be seen in the
FIG 5.1, depicting a graph of nodes for each page and edges for links between them.
[FIG 5.1 Graph of initial dataset]
Page Rank Link Farm Detection
www.ijeijournal.com Page | 58
5.1 Calculating Pagerank
Initial PageRank to be given to all pages will be 1/n, where n is the size of dataset, i.e., 1/11. Set the
initial damping factor. In our example we take damping factor as 0.85.
Calculate the PageRank values of all the pages according to the formula :
PR(A) = (1-d)/N + d (PR(T1)/C(T1) + ... + PR(Tn)/C(Tn))
where PR represents the PageRank of the represented page, C represents the total number of out bound links of
the page and d represents the damping factor.
For the given dataset, we obtain the following PageRank values after 30 iterations (Table 5.1.1):
Page Name PageRank
P1 0.0557838
P2 0.0426136
P3 0.0426136
P4 0.0426136
P5 0.0426136
P6 0.0426136
P7 0.0194318
P8 0.0359489
P9 0.0136364
P10 0.0136364
P11 0.049858
[Table 5.1.1 PageRank values of all pages in the dataset.]
We consider the PageRank values obtained after 30 iterations of PageRank calculation for all pages as
final because variations in the values after this will be negligible. Negligible changes reflect final values allotted
to each page.
5.2 Searching For Duplicate Pagerank Values
Scan the result for all the duplicate PageRank values using the best scanning technique available. The
pages with duplicate values are shown in Table 5.2.1. These pages may or may not belong to the spam group
(link farm) as some pages may have the same value as a spam page just by mere coincidence. To eliminate these
pages from the suspected pages, we perform the next step.
Page Name PageRank
P2 0.0426136
P3 0.0426136
P4 0.0426136
P5 0.0426136
P6 0.0426136
[Table 5.2.1 Pages having duplicate values]
5.3 Calculating Gaprank Values
Gap Rank values for the selected pages are calculated in the same way as PageRank values, by using
the formula shown in fig 4.1. The Gap Rank values obtained are shown in the Table 5.3.1
Page Name GapRank
P2 0.106365
P3 0.106365
P4 0.106365
P5 0.106365
P6 0.106365
[Table 5.2.1 GapRank values of pages having duplicate PageRank values ]
The pages having the same Gap Rank and PageRank values are identified as spam pages which
constitute the link farm.
Page Rank Link Farm Detection
www.ijeijournal.com Page | 59
5.4 Remove Link Farms From The Dataset
The pages constituting the link farm are removed from the data set and a new dataset is obtained
without the spam pages. The PageRank values of the pages in the new dataset are calculated from the beginning
with new initial PageRank values for all pages determined by 1/n’, where n’ is the new dataset size. The new
PageRank values are shown in the table 5.4.1 given below.
Page Name PageRank
P1 0.122724
P7 0.04275
P8 0.0790875
P9 0.03
P10 0.03
[Table5.4.1 New PageRank Values for non-spam pages]
REFERENCES
[1]. Sergey Brin and Lawrence Page: The Anatomy of a Large-Scale Hypertextual Web Search Engine
[2]. Richard D. Alba: A graph-theoretic definition of a Sociometric Clique
[3]. Baoning Wu and Brian D. Davison:Identifying Link Farm Spam Pages
[4]. Z. Gy¨ongyi and H. Garcia-Molina. Web spamtaxonomy. In Proceedings of the 1st internationalworkshop on Adversarial
information retrieval on theWeb, 2005.
[5]. D. Fetterly, M. Manasse and M. Najork. Spam, damnspam, and statistics: using statistical analysis tolocate spam Web pages. In
Proceedings of the 7th
International Workshop on the Web and Databases,2004.
[6]. A. Carvalho, P. Chirita, E. Moura and P. Calado. Sitelevel noise removal for search engines. In Proceedingsof the 15th international
conference on World WideWeb. 2006
[7]. A. A. Bencz´ur, K Csalog´any, T Sarl´os and M. Uher.SpamRank-fully automatic link spam detection. InProceedings of the 1st
international workshop onAdversarial information retrieval on the Web, 2005
[8]. L. Becchetti, C. Castillo, D. Donato, S. Leonardi andR. Baeza-Yates. Link-based characterization anddetection of Web spam. In
Proceedings of the 2nd
international workshop on Adversarial informationretrieval on the Web, 2006.
[9]. H. Saito, M. Toyoda, M. Kitsuregawa and K. Aihara: A large-scale study of link spam detection by graphalgorithms In Proceedings
of the 3rd internationalworkshop on Adversarial information retrieval on theWeb, 2007.

More Related Content

What's hot (17)

Ranking Web Pages
Ranking Web PagesRanking Web Pages
Ranking Web Pages
 
Done reread deeperinsidepagerank
Done reread deeperinsidepagerankDone reread deeperinsidepagerank
Done reread deeperinsidepagerank
 
Deeper Inside PageRank (NOTES)
Deeper Inside PageRank (NOTES)Deeper Inside PageRank (NOTES)
Deeper Inside PageRank (NOTES)
 
PageRank and Markov Chain
PageRank and Markov ChainPageRank and Markov Chain
PageRank and Markov Chain
 
J046045558
J046045558J046045558
J046045558
 
CSE509 Lecture 3
CSE509 Lecture 3CSE509 Lecture 3
CSE509 Lecture 3
 
Efficient focused web crawling approach
Efficient focused web crawling approachEfficient focused web crawling approach
Efficient focused web crawling approach
 
JPJ1423 Keyword Query Routing
JPJ1423   Keyword Query RoutingJPJ1423   Keyword Query Routing
JPJ1423 Keyword Query Routing
 
Sree saranya
Sree saranyaSree saranya
Sree saranya
 
WEB PAGE RANKING BASED ON TEXT SUBSTANCE OF LINKED PAGES
WEB PAGE RANKING BASED ON TEXT SUBSTANCE OF LINKED PAGESWEB PAGE RANKING BASED ON TEXT SUBSTANCE OF LINKED PAGES
WEB PAGE RANKING BASED ON TEXT SUBSTANCE OF LINKED PAGES
 
keyword query routing
keyword query routingkeyword query routing
keyword query routing
 
Extracting Resources that Help Tell Events' Stories
Extracting Resources that Help Tell Events' StoriesExtracting Resources that Help Tell Events' Stories
Extracting Resources that Help Tell Events' Stories
 
IEEE 2014 JAVA DATA MINING PROJECTS Keyword query routing
IEEE 2014 JAVA DATA MINING PROJECTS Keyword query routingIEEE 2014 JAVA DATA MINING PROJECTS Keyword query routing
IEEE 2014 JAVA DATA MINING PROJECTS Keyword query routing
 
Keyword query routing
Keyword query routingKeyword query routing
Keyword query routing
 
Topic-specific Web Crawler using Probability Method
Topic-specific Web Crawler using Probability MethodTopic-specific Web Crawler using Probability Method
Topic-specific Web Crawler using Probability Method
 
Data.Mining.C.8(Ii).Web Mining 570802461
Data.Mining.C.8(Ii).Web Mining 570802461Data.Mining.C.8(Ii).Web Mining 570802461
Data.Mining.C.8(Ii).Web Mining 570802461
 
Transitivity of Trust
Transitivity of TrustTransitivity of Trust
Transitivity of Trust
 

Viewers also liked

Web Quest Of Operating Systems
Web Quest Of Operating SystemsWeb Quest Of Operating Systems
Web Quest Of Operating SystemsLAMMYY
 
CapitolJS: Enyo, Node.js, & the State of webOS
CapitolJS: Enyo, Node.js, & the State of webOSCapitolJS: Enyo, Node.js, & the State of webOS
CapitolJS: Enyo, Node.js, & the State of webOSBen Combee
 
What is touch screen ?
What is touch screen ?What is touch screen ?
What is touch screen ?1ow4
 
OSS組み込み部会セミナー20110906
OSS組み込み部会セミナー20110906OSS組み込み部会セミナー20110906
OSS組み込み部会セミナー20110906himamura (暇村)
 
Web ISO Documentation Management System
Web ISO Documentation Management SystemWeb ISO Documentation Management System
Web ISO Documentation Management SystemT7SR, ISOM
 
Web Os Cloud Presentation
Web Os Cloud PresentationWeb Os Cloud Presentation
Web Os Cloud PresentationDamian Hamilton
 
Hp web os overview
Hp web os overviewHp web os overview
Hp web os overviewPeter Park
 
What's New? What's Next? - Asia Pacific Data Digest by Elizabeth Winkle
What's New? What's Next? - Asia Pacific Data Digest by Elizabeth WinkleWhat's New? What's Next? - Asia Pacific Data Digest by Elizabeth Winkle
What's New? What's Next? - Asia Pacific Data Digest by Elizabeth WinkleSTR
 
Mobile embrace mindshare_boehringeringelheim_fresh_thinking_innovative_solutions
Mobile embrace mindshare_boehringeringelheim_fresh_thinking_innovative_solutionsMobile embrace mindshare_boehringeringelheim_fresh_thinking_innovative_solutions
Mobile embrace mindshare_boehringeringelheim_fresh_thinking_innovative_solutionsadtechanz
 
Harness the web and grow your business
Harness the web and grow your businessHarness the web and grow your business
Harness the web and grow your businessJoe Drumgoole
 
An Introduction to Soft Computing
An Introduction to Soft ComputingAn Introduction to Soft Computing
An Introduction to Soft ComputingTameem Ahmad
 
Web Operating System Overview
Web Operating System OverviewWeb Operating System Overview
Web Operating System OverviewMadhu Bala
 
Intro To webOS
Intro To webOSIntro To webOS
Intro To webOSfpatton
 
soft-computing
 soft-computing soft-computing
soft-computingstudent
 
Soft computing (ANN and Fuzzy Logic) : Dr. Purnima Pandit
Soft computing (ANN and Fuzzy Logic)  : Dr. Purnima PanditSoft computing (ANN and Fuzzy Logic)  : Dr. Purnima Pandit
Soft computing (ANN and Fuzzy Logic) : Dr. Purnima PanditPurnima Pandit
 
Introduction to soft computing
Introduction to soft computingIntroduction to soft computing
Introduction to soft computingAnkush Kumar
 

Viewers also liked (20)

Web Quest Of Operating Systems
Web Quest Of Operating SystemsWeb Quest Of Operating Systems
Web Quest Of Operating Systems
 
CapitolJS: Enyo, Node.js, & the State of webOS
CapitolJS: Enyo, Node.js, & the State of webOSCapitolJS: Enyo, Node.js, & the State of webOS
CapitolJS: Enyo, Node.js, & the State of webOS
 
What is touch screen ?
What is touch screen ?What is touch screen ?
What is touch screen ?
 
OSS組み込み部会セミナー20110906
OSS組み込み部会セミナー20110906OSS組み込み部会セミナー20110906
OSS組み込み部会セミナー20110906
 
Web ISO Documentation Management System
Web ISO Documentation Management SystemWeb ISO Documentation Management System
Web ISO Documentation Management System
 
Types of Virus & Anti-virus
Types of Virus & Anti-virusTypes of Virus & Anti-virus
Types of Virus & Anti-virus
 
Web Os Cloud Presentation
Web Os Cloud PresentationWeb Os Cloud Presentation
Web Os Cloud Presentation
 
Ota presentation
Ota presentationOta presentation
Ota presentation
 
Hp web os overview
Hp web os overviewHp web os overview
Hp web os overview
 
What's New? What's Next? - Asia Pacific Data Digest by Elizabeth Winkle
What's New? What's Next? - Asia Pacific Data Digest by Elizabeth WinkleWhat's New? What's Next? - Asia Pacific Data Digest by Elizabeth Winkle
What's New? What's Next? - Asia Pacific Data Digest by Elizabeth Winkle
 
Mobile embrace mindshare_boehringeringelheim_fresh_thinking_innovative_solutions
Mobile embrace mindshare_boehringeringelheim_fresh_thinking_innovative_solutionsMobile embrace mindshare_boehringeringelheim_fresh_thinking_innovative_solutions
Mobile embrace mindshare_boehringeringelheim_fresh_thinking_innovative_solutions
 
Harness the web and grow your business
Harness the web and grow your businessHarness the web and grow your business
Harness the web and grow your business
 
An Introduction to Soft Computing
An Introduction to Soft ComputingAn Introduction to Soft Computing
An Introduction to Soft Computing
 
Soft computing08
Soft computing08Soft computing08
Soft computing08
 
Web Operating System Overview
Web Operating System OverviewWeb Operating System Overview
Web Operating System Overview
 
Grid Computing
Grid ComputingGrid Computing
Grid Computing
 
Intro To webOS
Intro To webOSIntro To webOS
Intro To webOS
 
soft-computing
 soft-computing soft-computing
soft-computing
 
Soft computing (ANN and Fuzzy Logic) : Dr. Purnima Pandit
Soft computing (ANN and Fuzzy Logic)  : Dr. Purnima PanditSoft computing (ANN and Fuzzy Logic)  : Dr. Purnima Pandit
Soft computing (ANN and Fuzzy Logic) : Dr. Purnima Pandit
 
Introduction to soft computing
Introduction to soft computingIntroduction to soft computing
Introduction to soft computing
 

Similar to Page Rank Link Farm Detection

Done rerea dlink-farm-spam
Done rerea dlink-farm-spamDone rerea dlink-farm-spam
Done rerea dlink-farm-spamJames Arnold
 
Done rerea dlink-farm-spam(2)
Done rerea dlink-farm-spam(2)Done rerea dlink-farm-spam(2)
Done rerea dlink-farm-spam(2)James Arnold
 
Enhancement in Weighted PageRank Algorithm Using VOL
Enhancement in Weighted PageRank Algorithm Using VOLEnhancement in Weighted PageRank Algorithm Using VOL
Enhancement in Weighted PageRank Algorithm Using VOLIOSR Journals
 
Done rerea dlink spam alliances good
Done rerea dlink spam alliances goodDone rerea dlink spam alliances good
Done rerea dlink spam alliances goodJames Arnold
 
A Generalization of the PageRank Algorithm : NOTES
A Generalization of the PageRank Algorithm : NOTESA Generalization of the PageRank Algorithm : NOTES
A Generalization of the PageRank Algorithm : NOTESSubhajit Sahu
 
IRJET- Page Ranking Algorithms – A Comparison
IRJET- Page Ranking Algorithms – A ComparisonIRJET- Page Ranking Algorithms – A Comparison
IRJET- Page Ranking Algorithms – A ComparisonIRJET Journal
 
Incremental Page Rank Computation on Evolving Graphs : NOTES
Incremental Page Rank Computation on Evolving Graphs : NOTESIncremental Page Rank Computation on Evolving Graphs : NOTES
Incremental Page Rank Computation on Evolving Graphs : NOTESSubhajit Sahu
 
Search engine page rank demystification
Search engine page rank demystificationSearch engine page rank demystification
Search engine page rank demystificationRaja R
 
PageRank_algorithm_Nfaoui_El_Habib
PageRank_algorithm_Nfaoui_El_HabibPageRank_algorithm_Nfaoui_El_Habib
PageRank_algorithm_Nfaoui_El_HabibEl Habib NFAOUI
 
Random web surfer pagerank algorithm
Random web surfer pagerank algorithmRandom web surfer pagerank algorithm
Random web surfer pagerank algorithmalexandrelevada
 
PageRank & Searching
PageRank & SearchingPageRank & Searching
PageRank & Searchingrahulbindra
 
PageRank Algorithm In data mining
PageRank Algorithm In data miningPageRank Algorithm In data mining
PageRank Algorithm In data miningMai Mustafa
 
LINEAR ALGEBRA BEHIND GOOGLE SEARCH
LINEAR ALGEBRA BEHIND GOOGLE SEARCHLINEAR ALGEBRA BEHIND GOOGLE SEARCH
LINEAR ALGEBRA BEHIND GOOGLE SEARCHDivyansh Verma
 

Similar to Page Rank Link Farm Detection (20)

TrustRank.PDF
TrustRank.PDFTrustRank.PDF
TrustRank.PDF
 
Done rerea dlink-farm-spam
Done rerea dlink-farm-spamDone rerea dlink-farm-spam
Done rerea dlink-farm-spam
 
Done rerea dlink-farm-spam(2)
Done rerea dlink-farm-spam(2)Done rerea dlink-farm-spam(2)
Done rerea dlink-farm-spam(2)
 
PageRank Algorithm
PageRank AlgorithmPageRank Algorithm
PageRank Algorithm
 
Enhancement in Weighted PageRank Algorithm Using VOL
Enhancement in Weighted PageRank Algorithm Using VOLEnhancement in Weighted PageRank Algorithm Using VOL
Enhancement in Weighted PageRank Algorithm Using VOL
 
Done rerea dlink spam alliances good
Done rerea dlink spam alliances goodDone rerea dlink spam alliances good
Done rerea dlink spam alliances good
 
A Generalization of the PageRank Algorithm : NOTES
A Generalization of the PageRank Algorithm : NOTESA Generalization of the PageRank Algorithm : NOTES
A Generalization of the PageRank Algorithm : NOTES
 
IRJET- Page Ranking Algorithms – A Comparison
IRJET- Page Ranking Algorithms – A ComparisonIRJET- Page Ranking Algorithms – A Comparison
IRJET- Page Ranking Algorithms – A Comparison
 
Incremental Page Rank Computation on Evolving Graphs : NOTES
Incremental Page Rank Computation on Evolving Graphs : NOTESIncremental Page Rank Computation on Evolving Graphs : NOTES
Incremental Page Rank Computation on Evolving Graphs : NOTES
 
Page Rank
Page RankPage Rank
Page Rank
 
Search engine page rank demystification
Search engine page rank demystificationSearch engine page rank demystification
Search engine page rank demystification
 
PageRank_algorithm_Nfaoui_El_Habib
PageRank_algorithm_Nfaoui_El_HabibPageRank_algorithm_Nfaoui_El_Habib
PageRank_algorithm_Nfaoui_El_Habib
 
Page Rank
Page RankPage Rank
Page Rank
 
Page Rank
Page RankPage Rank
Page Rank
 
Random web surfer pagerank algorithm
Random web surfer pagerank algorithmRandom web surfer pagerank algorithm
Random web surfer pagerank algorithm
 
PageRank & Searching
PageRank & SearchingPageRank & Searching
PageRank & Searching
 
PageRank Algorithm In data mining
PageRank Algorithm In data miningPageRank Algorithm In data mining
PageRank Algorithm In data mining
 
LINEAR ALGEBRA BEHIND GOOGLE SEARCH
LINEAR ALGEBRA BEHIND GOOGLE SEARCHLINEAR ALGEBRA BEHIND GOOGLE SEARCH
LINEAR ALGEBRA BEHIND GOOGLE SEARCH
 
Page Rank
Page RankPage Rank
Page Rank
 
Page Rank
Page RankPage Rank
Page Rank
 

More from International Journal of Engineering Inventions www.ijeijournal.com

More from International Journal of Engineering Inventions www.ijeijournal.com (20)

H04124548
H04124548H04124548
H04124548
 
G04123844
G04123844G04123844
G04123844
 
F04123137
F04123137F04123137
F04123137
 
E04122330
E04122330E04122330
E04122330
 
C04121115
C04121115C04121115
C04121115
 
B04120610
B04120610B04120610
B04120610
 
A04120105
A04120105A04120105
A04120105
 
F04113640
F04113640F04113640
F04113640
 
E04112135
E04112135E04112135
E04112135
 
D04111520
D04111520D04111520
D04111520
 
C04111114
C04111114C04111114
C04111114
 
B04110710
B04110710B04110710
B04110710
 
A04110106
A04110106A04110106
A04110106
 
I04105358
I04105358I04105358
I04105358
 
H04104952
H04104952H04104952
H04104952
 
G04103948
G04103948G04103948
G04103948
 
F04103138
F04103138F04103138
F04103138
 
E04102330
E04102330E04102330
E04102330
 
D04101822
D04101822D04101822
D04101822
 
C04101217
C04101217C04101217
C04101217
 

Recently uploaded

Energy Awareness training ppt for manufacturing process.pptx
Energy Awareness training ppt for manufacturing process.pptxEnergy Awareness training ppt for manufacturing process.pptx
Energy Awareness training ppt for manufacturing process.pptxsiddharthjain2303
 
CS 3251 Programming in c all unit notes pdf
CS 3251 Programming in c all unit notes pdfCS 3251 Programming in c all unit notes pdf
CS 3251 Programming in c all unit notes pdfBalamuruganV28
 
"Exploring the Essential Functions and Design Considerations of Spillways in ...
"Exploring the Essential Functions and Design Considerations of Spillways in ..."Exploring the Essential Functions and Design Considerations of Spillways in ...
"Exploring the Essential Functions and Design Considerations of Spillways in ...Erbil Polytechnic University
 
OOP concepts -in-Python programming language
OOP concepts -in-Python programming languageOOP concepts -in-Python programming language
OOP concepts -in-Python programming languageSmritiSharma901052
 
ROBOETHICS-CCS345 ETHICS AND ARTIFICIAL INTELLIGENCE.ppt
ROBOETHICS-CCS345 ETHICS AND ARTIFICIAL INTELLIGENCE.pptROBOETHICS-CCS345 ETHICS AND ARTIFICIAL INTELLIGENCE.ppt
ROBOETHICS-CCS345 ETHICS AND ARTIFICIAL INTELLIGENCE.pptJohnWilliam111370
 
Novel 3D-Printed Soft Linear and Bending Actuators
Novel 3D-Printed Soft Linear and Bending ActuatorsNovel 3D-Printed Soft Linear and Bending Actuators
Novel 3D-Printed Soft Linear and Bending ActuatorsResearcher Researcher
 
Turn leadership mistakes into a better future.pptx
Turn leadership mistakes into a better future.pptxTurn leadership mistakes into a better future.pptx
Turn leadership mistakes into a better future.pptxStephen Sitton
 
signals in triangulation .. ...Surveying
signals in triangulation .. ...Surveyingsignals in triangulation .. ...Surveying
signals in triangulation .. ...Surveyingsapna80328
 
CME 397 - SURFACE ENGINEERING - UNIT 1 FULL NOTES
CME 397 - SURFACE ENGINEERING - UNIT 1 FULL NOTESCME 397 - SURFACE ENGINEERING - UNIT 1 FULL NOTES
CME 397 - SURFACE ENGINEERING - UNIT 1 FULL NOTESkarthi keyan
 
Prach: A Feature-Rich Platform Empowering the Autism Community
Prach: A Feature-Rich Platform Empowering the Autism CommunityPrach: A Feature-Rich Platform Empowering the Autism Community
Prach: A Feature-Rich Platform Empowering the Autism Communityprachaibot
 
2022 AWS DNA Hackathon 장애 대응 솔루션 jarvis.
2022 AWS DNA Hackathon 장애 대응 솔루션 jarvis.2022 AWS DNA Hackathon 장애 대응 솔루션 jarvis.
2022 AWS DNA Hackathon 장애 대응 솔루션 jarvis.elesangwon
 
Artificial Intelligence in Power System overview
Artificial Intelligence in Power System overviewArtificial Intelligence in Power System overview
Artificial Intelligence in Power System overviewsandhya757531
 
High Voltage Engineering- OVER VOLTAGES IN ELECTRICAL POWER SYSTEMS
High Voltage Engineering- OVER VOLTAGES IN ELECTRICAL POWER SYSTEMSHigh Voltage Engineering- OVER VOLTAGES IN ELECTRICAL POWER SYSTEMS
High Voltage Engineering- OVER VOLTAGES IN ELECTRICAL POWER SYSTEMSsandhya757531
 
Robotics-Asimov's Laws, Mechanical Subsystems, Robot Kinematics, Robot Dynami...
Robotics-Asimov's Laws, Mechanical Subsystems, Robot Kinematics, Robot Dynami...Robotics-Asimov's Laws, Mechanical Subsystems, Robot Kinematics, Robot Dynami...
Robotics-Asimov's Laws, Mechanical Subsystems, Robot Kinematics, Robot Dynami...Sumanth A
 
List of Accredited Concrete Batching Plant.pdf
List of Accredited Concrete Batching Plant.pdfList of Accredited Concrete Batching Plant.pdf
List of Accredited Concrete Batching Plant.pdfisabel213075
 
FUNCTIONAL AND NON FUNCTIONAL REQUIREMENT
FUNCTIONAL AND NON FUNCTIONAL REQUIREMENTFUNCTIONAL AND NON FUNCTIONAL REQUIREMENT
FUNCTIONAL AND NON FUNCTIONAL REQUIREMENTSneha Padhiar
 
Levelling - Rise and fall - Height of instrument method
Levelling - Rise and fall - Height of instrument methodLevelling - Rise and fall - Height of instrument method
Levelling - Rise and fall - Height of instrument methodManicka Mamallan Andavar
 
Module-1-(Building Acoustics) Noise Control (Unit-3). pdf
Module-1-(Building Acoustics) Noise Control (Unit-3). pdfModule-1-(Building Acoustics) Noise Control (Unit-3). pdf
Module-1-(Building Acoustics) Noise Control (Unit-3). pdfManish Kumar
 
Stork Webinar | APM Transformational planning, Tool Selection & Performance T...
Stork Webinar | APM Transformational planning, Tool Selection & Performance T...Stork Webinar | APM Transformational planning, Tool Selection & Performance T...
Stork Webinar | APM Transformational planning, Tool Selection & Performance T...Stork
 
Python Programming for basic beginners.pptx
Python Programming for basic beginners.pptxPython Programming for basic beginners.pptx
Python Programming for basic beginners.pptxmohitesoham12
 

Recently uploaded (20)

Energy Awareness training ppt for manufacturing process.pptx
Energy Awareness training ppt for manufacturing process.pptxEnergy Awareness training ppt for manufacturing process.pptx
Energy Awareness training ppt for manufacturing process.pptx
 
CS 3251 Programming in c all unit notes pdf
CS 3251 Programming in c all unit notes pdfCS 3251 Programming in c all unit notes pdf
CS 3251 Programming in c all unit notes pdf
 
"Exploring the Essential Functions and Design Considerations of Spillways in ...
"Exploring the Essential Functions and Design Considerations of Spillways in ..."Exploring the Essential Functions and Design Considerations of Spillways in ...
"Exploring the Essential Functions and Design Considerations of Spillways in ...
 
OOP concepts -in-Python programming language
OOP concepts -in-Python programming languageOOP concepts -in-Python programming language
OOP concepts -in-Python programming language
 
ROBOETHICS-CCS345 ETHICS AND ARTIFICIAL INTELLIGENCE.ppt
ROBOETHICS-CCS345 ETHICS AND ARTIFICIAL INTELLIGENCE.pptROBOETHICS-CCS345 ETHICS AND ARTIFICIAL INTELLIGENCE.ppt
ROBOETHICS-CCS345 ETHICS AND ARTIFICIAL INTELLIGENCE.ppt
 
Novel 3D-Printed Soft Linear and Bending Actuators
Novel 3D-Printed Soft Linear and Bending ActuatorsNovel 3D-Printed Soft Linear and Bending Actuators
Novel 3D-Printed Soft Linear and Bending Actuators
 
Turn leadership mistakes into a better future.pptx
Turn leadership mistakes into a better future.pptxTurn leadership mistakes into a better future.pptx
Turn leadership mistakes into a better future.pptx
 
signals in triangulation .. ...Surveying
signals in triangulation .. ...Surveyingsignals in triangulation .. ...Surveying
signals in triangulation .. ...Surveying
 
CME 397 - SURFACE ENGINEERING - UNIT 1 FULL NOTES
CME 397 - SURFACE ENGINEERING - UNIT 1 FULL NOTESCME 397 - SURFACE ENGINEERING - UNIT 1 FULL NOTES
CME 397 - SURFACE ENGINEERING - UNIT 1 FULL NOTES
 
Prach: A Feature-Rich Platform Empowering the Autism Community
Prach: A Feature-Rich Platform Empowering the Autism CommunityPrach: A Feature-Rich Platform Empowering the Autism Community
Prach: A Feature-Rich Platform Empowering the Autism Community
 
2022 AWS DNA Hackathon 장애 대응 솔루션 jarvis.
2022 AWS DNA Hackathon 장애 대응 솔루션 jarvis.2022 AWS DNA Hackathon 장애 대응 솔루션 jarvis.
2022 AWS DNA Hackathon 장애 대응 솔루션 jarvis.
 
Artificial Intelligence in Power System overview
Artificial Intelligence in Power System overviewArtificial Intelligence in Power System overview
Artificial Intelligence in Power System overview
 
High Voltage Engineering- OVER VOLTAGES IN ELECTRICAL POWER SYSTEMS
High Voltage Engineering- OVER VOLTAGES IN ELECTRICAL POWER SYSTEMSHigh Voltage Engineering- OVER VOLTAGES IN ELECTRICAL POWER SYSTEMS
High Voltage Engineering- OVER VOLTAGES IN ELECTRICAL POWER SYSTEMS
 
Robotics-Asimov's Laws, Mechanical Subsystems, Robot Kinematics, Robot Dynami...
Robotics-Asimov's Laws, Mechanical Subsystems, Robot Kinematics, Robot Dynami...Robotics-Asimov's Laws, Mechanical Subsystems, Robot Kinematics, Robot Dynami...
Robotics-Asimov's Laws, Mechanical Subsystems, Robot Kinematics, Robot Dynami...
 
List of Accredited Concrete Batching Plant.pdf
List of Accredited Concrete Batching Plant.pdfList of Accredited Concrete Batching Plant.pdf
List of Accredited Concrete Batching Plant.pdf
 
FUNCTIONAL AND NON FUNCTIONAL REQUIREMENT
FUNCTIONAL AND NON FUNCTIONAL REQUIREMENTFUNCTIONAL AND NON FUNCTIONAL REQUIREMENT
FUNCTIONAL AND NON FUNCTIONAL REQUIREMENT
 
Levelling - Rise and fall - Height of instrument method
Levelling - Rise and fall - Height of instrument methodLevelling - Rise and fall - Height of instrument method
Levelling - Rise and fall - Height of instrument method
 
Module-1-(Building Acoustics) Noise Control (Unit-3). pdf
Module-1-(Building Acoustics) Noise Control (Unit-3). pdfModule-1-(Building Acoustics) Noise Control (Unit-3). pdf
Module-1-(Building Acoustics) Noise Control (Unit-3). pdf
 
Stork Webinar | APM Transformational planning, Tool Selection & Performance T...
Stork Webinar | APM Transformational planning, Tool Selection & Performance T...Stork Webinar | APM Transformational planning, Tool Selection & Performance T...
Stork Webinar | APM Transformational planning, Tool Selection & Performance T...
 
Python Programming for basic beginners.pptx
Python Programming for basic beginners.pptxPython Programming for basic beginners.pptx
Python Programming for basic beginners.pptx
 

Page Rank Link Farm Detection

  • 1. International Journal of Engineering Inventions e-ISSN: 2278-7461, p-ISSN: 2319-6491 Volume 4, Issue 1 (July 2014) PP: 55-59 www.ijeijournal.com Page | 55 Page Rank Link Farm Detection Akshay Saxena1 , Rohit Nigam2 1, 2 Department Of Information and Communication Technology (ICT) Manipal University, Manipal - 576014, Karnataka, India Abstract: The PageRank algorithm is an important algorithm which is implemented to determine the quality of a page on the web. With search engines attaining a high position in guiding the traffic on the internet, PageRank is an important factor to determine its flow. Since link analysis is used in search engine’s ranking systems, link based spam structure known as link farms are created by spammers to generate a high PageRank for their and in turn a target page. In this paper, we suggest a method through which these structures can be detected and thus the overall ranking results can be improved. I. Introduction 1.1 Page Rank We will be working primarily on PageRank[1] algorithm developed by Larry Page of Google. This algorithm determines the importance of a website by counting the number and quality of links to a page, assuming that an important website will have more links pointing to it by other websites. This technique is used in Google's search process to rank websites of a search result. It is not the only algorithm used by Google to order search engine results, but it is the first algorithm that was used by the company, and it is the best-known. The counting and indexing of Web pages is done through various programs like webcrawlers[1] and other bot programs. By assigning a numerical weighting to each link, it repeatedly applies the PageRank algorithm to get a more accurate result and determine the relative rank of a Web page within a set pages. The set of Web pages being considered is thought of as a directed graph with each node representing a page and each edge as a link between pages. We determine the number of links pointing to a page by traversing the graph and use those values in the PageRank formula. Pagerank is quite prone to manipulation. Our goal was to find an effective way to ignore links from pages that are trying to falsify a Pagerank. These spamming pages usually occur as a group called link farms described below. 1.2 Complete Graph In the mathematical field of graph theory, a complete graph[2] is a simple undirected graph in which every pair of distinct vertices is connected by a unique edge. A complete digraph is a directed graph in which every pair of distinct vertices is connected by a pair of unique edges (one in each direction). 1.3 Clique In the mathematical area of graph theory, a clique[2] an undirected graph is a subset of its vertices such that every two vertices in the subset are connected by an edge. Cliques are one of the basic concepts of graph theory and are used in many other mathematical problems and constructions on graphs. Cliques have also been studied in computer science: the task of finding whether there is a clique of a given size in a graph (the clique problem) is NP-complete, but despite this hardness result many algorithms for finding cliques have been studied. FIG 1.1
  • 2. Page Rank Link Farm Detection www.ijeijournal.com Page | 56 1.4 Link Farm On the World Wide Web, a link farm[3] is any group of web sites that all hyperlink to every other site in the group. In graph theoretic terms, a link farm is a clique. Although some link farms can be created by hand, most are created through automated programs and services. A link farm is a form of spamming the index of a search engine (sometimes called spamdexing or spamexing). Other link exchange systems are designed to allow individual websites to selectively exchange links with other relevant websites and are not considered a form of spamdexing. FIG 1.2 Normal Graph of Web Pages FIG 1.3 Link Farm of 1, 2, 3 & 4 II. Existing System Academic citation literature has been applied to the web, largely by counting citations or backlinks to a given page. This gives some approximation of a page's importance or quality. PageRank extends this idea by not counting links from all pages equally, and by normalizing by the number of links on a page. PageRank is defined as follows: We assume page A has pages T1...Tn which point to it (i.e., are inbound links). The parameter d is a damping factor which can be set between 0 and 1. We usually set d to 0.85. There are more details about d in the next section. Also C(A) is defined as the number of links going out of page A. The PageRank of a page A is given as follows: PR(A) = (1-d)/N + d (PR(T1)/C(T1) + ... + PR(Tn)/C(Tn)) Note that the PageRank values form a probability distribution over web pages, so the sum of all web pages' PageRank values will be one. PageRank or PR(A) can be calculated using a simple iterative algorithm, and corresponds to the principal eigenvector of the normalized link matrix of the web. Also, a PageRank for 26 million web pages can be computed in a few hours on a medium size workstation III. Previous Work Link spamming is one of the web spamming techniques that try to mislead link-based ranking algorithms such as PageRank and HITS. Since these algorithms consider a link to pages as an endorsement for that page, spammers create numerous false links and construct an artificially interlinked link structure, so called a spam farm, to centralize link-based importance to their own spam pages. To understand the web spamming, Gyöngyi et al[4] described various web spamming techniques in. Optimal link structures to boost PageRank scores are also studied to grasp the behavior of web spammers. Fetterly[5] found out that outliers in statistical distributions are very likely to be spam by analyzing statistical properties of linkage, URL, host resolutions and contents of pages. To demote link spam, Gyöngyi et alintroduced TrustRank[4] that is a biased PageRank where rank scores start to be propagated from a seed set of good pages through outgoing links. By this, we can expect spam pages to get low rank. Optimizing the link structure is another approach to demote link spam. Carvalho[6] proposed the idea of noisy links, a link structure that has a negative impact on link-based ranking algorithms. Qi et al. also estimated the quality of links by similarity of two pages. To detect link spam, Bencz´ur introduced SpamRank[7] . SpamRank checks PageRank score distributions of all in-neighbors of a target page. If this distribution is abnormal, SpamRank regards a target page as a spam and penalizes it. Becchetti[8] employed link-based features for the link spam detection. They built a link spam classifier with several features of the link structure like degrees, link-based ranking scores, and characteristics of out-
  • 3. Page Rank Link Farm Detection www.ijeijournal.com Page | 57 neighbors. Saito[9] employed a graph algorithm to detect link spam. They decomposed the Web graph into strongly connected components and discovered that large components are spam with high probability. Link farms in the core were extracted by maximal clique enumeration. IV. Proposed System The PageRank algorithm, as described in the previous section, works upon a given set of pages found via search using Web spiders or other programs. Now to rank these pages, we consider the links of each page and apply the PageRank algorithm to it. However, the page rank algorithm can be easily fooled by link farms. Link Farms work as follows - We know that the PageRank is higher for pages with more links pointing to it i.e. more the in-line, higher the rank. A link farm forms a complete graph. This means, we have a large set of pages pointing to each other, thus increasing the number of in-links and out-links even they are not relevant, hence they falsify the Page Rank values obtained. On applying the algorithm to this graph, an interesting fact came to notice that the Page Rank values of all the pages in a link farm was the same and remained so even after multiple iterations of the algorithm. One could easily use this fact to identify link farms and remove such sets of pages from our whole set. But the problem occurs when a page which is not a part of the link farm and is actually relevant, it happens to have the same or nearly same PageRank as that of the link farm pages. So if we consider pages with same page ranks for elimination, some relevant pages might also get discarded unintentionally. To solve this problem, we introduce a new factor we call Gap Rank. GapRank is based on PageRank and in a way inverse of it. While PageRank is based on inbound links of a page, GapRank is calculated using outbound links of pages. Gaprank Formula GR(A) = (1-d)/N + d (GR(T1)/L(T1) + ... + GR(Tn)/L(Tn)) where A is the page under consideration, T1,T2….Tn the set of pages that link from A i.e. outbound pages,L(Tn) is the number of inbound links on page Tn, and N is the total number of pages. The purpose of this formula is to rank the set of pages based on the number of outbound links. We then use the GapRank to identify the Link Farm. Since all the pages in a link farm have same number of inbound links and outbound links, the GapRank of these pages will be equal, just like their PageRanks. This can now be used to differentiate a link farm from a page having almost same PageRank as those of link farm pages, but it's GapRank will be different. This analysis can further be used to reduce the ranks of these spam pages or eliminate them altogether so that we get proper ranks for pages that suffered because of the link farm. V. Methodology Consider a given dataset consisting of 11 pages P1, P2,..,P11. The links between the pages can be seen in the FIG 5.1, depicting a graph of nodes for each page and edges for links between them. [FIG 5.1 Graph of initial dataset]
  • 4. Page Rank Link Farm Detection www.ijeijournal.com Page | 58 5.1 Calculating Pagerank Initial PageRank to be given to all pages will be 1/n, where n is the size of dataset, i.e., 1/11. Set the initial damping factor. In our example we take damping factor as 0.85. Calculate the PageRank values of all the pages according to the formula : PR(A) = (1-d)/N + d (PR(T1)/C(T1) + ... + PR(Tn)/C(Tn)) where PR represents the PageRank of the represented page, C represents the total number of out bound links of the page and d represents the damping factor. For the given dataset, we obtain the following PageRank values after 30 iterations (Table 5.1.1): Page Name PageRank P1 0.0557838 P2 0.0426136 P3 0.0426136 P4 0.0426136 P5 0.0426136 P6 0.0426136 P7 0.0194318 P8 0.0359489 P9 0.0136364 P10 0.0136364 P11 0.049858 [Table 5.1.1 PageRank values of all pages in the dataset.] We consider the PageRank values obtained after 30 iterations of PageRank calculation for all pages as final because variations in the values after this will be negligible. Negligible changes reflect final values allotted to each page. 5.2 Searching For Duplicate Pagerank Values Scan the result for all the duplicate PageRank values using the best scanning technique available. The pages with duplicate values are shown in Table 5.2.1. These pages may or may not belong to the spam group (link farm) as some pages may have the same value as a spam page just by mere coincidence. To eliminate these pages from the suspected pages, we perform the next step. Page Name PageRank P2 0.0426136 P3 0.0426136 P4 0.0426136 P5 0.0426136 P6 0.0426136 [Table 5.2.1 Pages having duplicate values] 5.3 Calculating Gaprank Values Gap Rank values for the selected pages are calculated in the same way as PageRank values, by using the formula shown in fig 4.1. The Gap Rank values obtained are shown in the Table 5.3.1 Page Name GapRank P2 0.106365 P3 0.106365 P4 0.106365 P5 0.106365 P6 0.106365 [Table 5.2.1 GapRank values of pages having duplicate PageRank values ] The pages having the same Gap Rank and PageRank values are identified as spam pages which constitute the link farm.
  • 5. Page Rank Link Farm Detection www.ijeijournal.com Page | 59 5.4 Remove Link Farms From The Dataset The pages constituting the link farm are removed from the data set and a new dataset is obtained without the spam pages. The PageRank values of the pages in the new dataset are calculated from the beginning with new initial PageRank values for all pages determined by 1/n’, where n’ is the new dataset size. The new PageRank values are shown in the table 5.4.1 given below. Page Name PageRank P1 0.122724 P7 0.04275 P8 0.0790875 P9 0.03 P10 0.03 [Table5.4.1 New PageRank Values for non-spam pages] REFERENCES [1]. Sergey Brin and Lawrence Page: The Anatomy of a Large-Scale Hypertextual Web Search Engine [2]. Richard D. Alba: A graph-theoretic definition of a Sociometric Clique [3]. Baoning Wu and Brian D. Davison:Identifying Link Farm Spam Pages [4]. Z. Gy¨ongyi and H. Garcia-Molina. Web spamtaxonomy. In Proceedings of the 1st internationalworkshop on Adversarial information retrieval on theWeb, 2005. [5]. D. Fetterly, M. Manasse and M. Najork. Spam, damnspam, and statistics: using statistical analysis tolocate spam Web pages. In Proceedings of the 7th International Workshop on the Web and Databases,2004. [6]. A. Carvalho, P. Chirita, E. Moura and P. Calado. Sitelevel noise removal for search engines. In Proceedingsof the 15th international conference on World WideWeb. 2006 [7]. A. A. Bencz´ur, K Csalog´any, T Sarl´os and M. Uher.SpamRank-fully automatic link spam detection. InProceedings of the 1st international workshop onAdversarial information retrieval on the Web, 2005 [8]. L. Becchetti, C. Castillo, D. Donato, S. Leonardi andR. Baeza-Yates. Link-based characterization anddetection of Web spam. In Proceedings of the 2nd international workshop on Adversarial informationretrieval on the Web, 2006. [9]. H. Saito, M. Toyoda, M. Kitsuregawa and K. Aihara: A large-scale study of link spam detection by graphalgorithms In Proceedings of the 3rd internationalworkshop on Adversarial information retrieval on theWeb, 2007.